require(tidyverse)
require(DiagrammeR)
require(poppr)
require(genepop)
require(graph4lg)
require(related)
require(adegenet)
require(knitr)
require(magrittr)
This is document is an R notebook. If you’d like view to pre-rendered figures, read a summary of analysis and interact with code, please open the relevant html file in a browser.
To conduct a similar analyses on your computer, edit or run code: clone this repository into a directory on you r local machine and open the .Rproj file in Rstudio.
This repository is available at https://github.com/david-dayan/coastal_chinook_75bp_test
This notebook details panel quality control for the 75bp Chinook GTseq panel.
Specific Goals:
(1) Look for failed primers (extremely low reads across all samples)
(2) look for read hogs (over amplifying primers)
(3) primer specificity
(4) probe specificity
(5) dimer formation
(6) New Markers
With respect to goal (6), the input data includes samples prepared with and without the additional of several new test markers tagging the RoSA on Ots28. We will compare the panel performance with and without these markers.
Approach
Goals above can mostly be accomplished by pooling all the reads from the test run into a single fastq, hashing the file, then running some included optimization scripts from the gtseq pipeline. I split the data into two sets: samples prepared with the Ots355 primers and those prepared with the Ots355+RoSA primers and run the analyses below on each.
GTseq_SeqTest
The GTseq_SeqTest script searches for unique (exact) occurences of each primer and probe, and when both occur together. This can be used to identify low primer specificity (high ratio of probe:primer) or low probe specificity (high ratio of primer:probe). We can also look at the distribution of reads to identify any read hogs, or estimate if any poorly amplifying markers can be resuced with the increased depth from removing read hogs/reducing their primer concentration. I couldn’t get this script to run correctly, after a few tries, so instead wrote a script employing awk and seqkit to achieve a similar goal (count primer and probe sequences).
Genotyping
We can run the genotyping script and then look for paralogs, and adjust allele correction values or change in silico probe length to increase specificity to avoid paralogs. Similarly, we can look for loci with non-normal distributions of reads across individuals to attempt to identify if primer or probe sequences occur over common variants. We can “rescue” these markers by shortening the in-silico primer or probe sequences to avoid the variant, or by including base ambiguity into the primer/probe sequences, but then this new in silico primer/probe needs to be tested again for specificity.
GTseq_Primer-Interaction-Test
This script looks for the reverse complement of the reverse primer sequence in off target reads, thereby identifying reads that formed from primer-dimers. it runs off the hash file.
Samples
Two sets of samples were independently prepared, set A (OtsCoastal2020_PT1) and set B (OtsCoastal2020_PT3). Populations and sample sizes are summarized in the table below
#read metadata
meta_data <- read_csv("metadata/CoastalChinook_Run017_Metadata.csv")
#clean this up
meta_data %<>%
mutate(DateSampled = as.Date(DateSampled, format = "%m/%d/%Y")) %>%
rename(sample = `Individual Name`, date = DateSampled, sex = Gender, marks = Marks, location = IndividualSampleLOCATION ) %>%
mutate(pop = str_sub(Pedigree, 8, 11)) %>%
select(sample, pop, date, sex, marks, location)
#summarize
kable(meta_data %>%
group_by(pop) %>%
tally() %>%
ungroup())
| pop | n |
|---|---|
| NESR | 30 |
| SILR | 302 |
| TRAR | 48 |
Panels
Each set of samples was run twice: once with the new 75bp Chinook panel (referred to as OtsGTseqV6.1 at CRITFC) and once with this panel + additional Ots28 markers tagging the RoSA developed by SWFSC.
Primers from OtsGTseqV6.1 were sent in an aliquot and contained all 351 markers. The RoSA markers include 7 markers individually spiked into library prep, however 2 were excluded: Ots37124-12270118 interacted with Ots_103122-180 (from OtsGTseqV6.1) in previous tests, however, the primers from OtsGTseqV6.1 were sent in a pre-mixed aliquot so Ots_103122-180 could not be excluded, and Ots37124-12270118 was excluded from the spike-in instead. The second excluded RoSA marker was Ots37124_12279142, as it is a duplicate with a marker in OtsGTseqV6.1.
The total panel size for the OtsGTseqV6.1 + RoSA markers then is 356 (351 + 7 - 2). For short i refer to OtsGTseqV6.1 panel alone as Ots351, and OtsGTseqV6.1 + RoSA as Ots356
Panels are split by library prep plate. OtsGTseqV6.1 panel alone (Ots351) plates are plate11 (i7 sequence:CGGATG) and plate13 (i7 sequence:AACTTA). OtsGTseqV6.1+RoSA panel (Ots356) plates are plate 12 (i7 sequence:AATGTC) and 14 (i7 sequence:ACCGGA).
Sequencing data
Did not examine read quality for this project. Reads are available at ../../../dayan/coastal_chinook_2020/75bp_panel_qc/raw_reads
First we separate the sample fastqs into those using OtsGTseqV6.1 primers form those using OtsGTseqV6.1+RoSA primers. This section deals with only samples genotyped at the OtsGTseqV6.1 markers, for OtsGTseqV6.1+RoSA marker samples see the relevant section below.
First let’s split into Ots351 and Ots356 files.
# separate reads into groups by i7 sequence
# the four i7 sequences used here are not also used as i5 sequences, so we can do this with some simple filename matching
#from raw_seqs dir
mkdir Ots351
mkdir Ots356
find ./ -iname "*CGGATG*" -print0 | xargs -0 -I {} mv {} Ots351
find ./ -iname "*AACTTA*" -print0 | xargs -0 -I {} mv {} Ots351
find ./ -iname "*AATGTC*" -print0 | xargs -0 -I {} mv {} Ots356
find ./ -iname "*ACCGGA*" -print0 | xargs -0 -I {} mv {} Ots356
#also let's move the late-summer run rogue steelhead files (in same sequencing run)
find ./ -iname "*CGGATG*" -print0 | xargs -0 -I {} mv {} Ots351
Now we’ll decompress these files
#note this is run in each directory
#!/bin/bash
#$ -S /bin/bash
#$ -t 1-194
#$ -tc 30
#$ -N decompress_fastqs
#$ -cwd
#$ -o $JOB_NAME_$TASK_ID.out
#$ -e $JOB_NAME_$TASK_ID.err
FASTQS=(`ls -1 *fastq.gz`)
INFILE=${FASTQS[$SGE_TASK_ID -1]}
gunzip -c $INFILE > ./${INFILE%.gz}
#save this code chunk as a file on the server and submit this with qsub -q harold scriptname from the directory you want the output .genos files
Finally we’ll build the hash files which we will use throughout this notebook
# in each raw_reads subdirectory
# first merge all sample reads into a single file
cat *fastq > ../merged_Ots351.fasta
cat *fastq > ../merged_Ots356.fasta
#run hash (did with interactive shell (qrsh))
perl /dfs/Omalley_Lab/dayan/software/GTseq-Pipeline/GTseq_HashSeqs.pl merged_Ots351.fasta > merged_Ots351_hash.txt
perl /dfs/Omalley_Lab/dayan/software/GTseq-Pipeline/GTseq_HashSeqs.pl merged_Ots356.fasta > merged_Ots356_hash.txt
# rm the merged files (very big and same data already copied many times)
The GTSeqTest script (from the gtseq pipeline) performs an exact matching search among the hash of reads from the test panel for (1) primer sequences, (2) probe sequences and (3) both 1 and 2 together in the same read. I have had some strange results when attempting to use this script in the past so wrote a different approach using seqkit and awk.
First some prep. Need to make a list of probe sequences, with reverse complements.
# Collect the primer and probe sequences from the OtsGTseqv6.1 summary files and saved as Ots351_probeseqs.txt
panel_info <- readxl::read_xlsx("metadata/Ots Loci Information. BPA. IDT. PROBEseq 1Feb2021.xlsx", sheet = 2)
Ots351_probes <- select(panel_info, `A1-Probe`, `A2-Probe`, `FWD Primer`)
Ots351_probes %<>%
mutate(probe1 = case_when(
str_detect(`A1-Probe`, pattern = "\\[") == TRUE ~ str_replace(`A1-Probe`, "([ATCG]*)\\[([ATCG])([ATCG])\\]([ATCG]*)", "\\1\\2\\4,\\1\\3\\4" ),
TRUE ~ `A1-Probe`)) %>%
mutate(probe2 = case_when(
str_detect(`A2-Probe`, pattern = "\\[") == TRUE ~ str_replace(`A2-Probe`, "([ATCG]*)\\[([ATCG])([ATCG])\\]([ATCG]*)", "\\1\\2\\4,\\1\\3\\4" ),
TRUE ~ `A2-Probe`))
#this (above) broke for a few markers when there were multiple base ambiguities in the probes, went in a text editor and just did these manually
#write_tsv(Ots351_probes, file = "metadata/Ots351_probes_for_awk.txt")
Ots351_probes <- read_tsv("metadata/Ots351_probes_for_awk.txt")
#reverse complement the probe sequences and combine into string in new column
Ots351_probes %<>%
separate(probe1, c("probe1a","probe1b","probe1c","probe1d")) %>%
separate(probe2, c("probe2a","probe2b","probe2c","probe2d")) %>%
mutate(across(.cols=starts_with("probe"), .fns = insect::rc, .names = "{.col}_rc"))
Ots351_probes <- Ots351_probes %>%
unite("probes_for_awk", probe1a:probe2d_rc, sep = ",", na.rm = TRUE, remove = FALSE)
#write_tsv(Ots351_probes, file = "metadata/Ots351_probes_for_awk2.txt")
#use this infor to build the functions below
Now we count primer hits in the hash file
#!/bin/bash
#$ -S /bin/bash
#$ -N primer_count
#$ -cwd
#$ -o $JOB_NAME.out
#$ -e $JOB_NAME.err
PRIMERS="GGTCTTGCAGTCAGGAGAGG
AACCCTATGGGAACTCGTAGAACT
TGCCGCTGGATTTATTGACA
GTATAAACTAGAGTCCAGTGTTATGTTAATGTCTT
GCATTACTAAAAACTGGTGTGTGGAA
GCCTACCAGAAAGTACCAATTGTGA
GCCCTGGAGAAGTACGTTTTAAACTAA
CTGCATGAACGTTAACTCAAATAAAAGGT
TGCAGGAAGAGTTCAGAGAAATCT
AGACAATCATGGTGTTTTGAGTCTTTCT
CAAAGTGCAGGTGCTGGC
TGCAGGCATCATGCTTAATAACT
TGCAGTGCTGAATTAGAGATTAATTTTTGTG
CATTCCATGACAATGATTGAAATCTAAAAACAC
CGTGGAGTAGGTGGTTACAGTTTAT
GGATGGTTGTCATTTCTCTGCAAA
GGTGATTTTGCCACAGAGTAGAGAT
CACTAAATATTCCTTATCATTTCATACTAAGTCTGAAGAA
CGCGAGTTAGCTCGAATATTATGATTTC
GCAGGGCGCAAAGTTCTT
CAGATGAAAAATAAATAATTGGGCCATTAGGAA
GCATGTAACACATTATTTGGCATATGTACT
CCTTTTTCTTATTAGTTTTACTTCCCCAGAGA
GAAAAAGTAAAGTAAAAGTAAAGTATTATACCACTAAAGACAAT
GCCATGGAGGACTGGATGA
CAAGGGATGTGACAAATTAATCAAACACATAA
GTTTGGCTATTGAAATTATACATTAAAACATGTAGCT
CCGTTGCAGGACTCATCAGT
TGTCATCTCTATTGCAATCTCAGTAGATTTCTAT
TGCCATCATAAACAACCTAACAAGTAACT
TGGAGAACTTGCACTGAATGTGAAA
GCCTACTGATAAATGTATGACAGTAATGGA
CAGGTCTGGTCTACATCGAACAC
TCAAAGACATCGAACACAAGAACGA
AGTTGTTCTTTTTATATTGTGTTTTTATTCCATTCCA
GTTGGCTCCTTCAATTCAATTTGGA
AGCTAGGCTGTAAATGCAAGGAT
TTGTGGAATTACACCTTCAGAGTTCAAT
CATTTCCACGAAAAGCCAGATGAC
TTTTCTAGGACAGGTTGCTTGCA
GTCAACAAATGCAGGTAACATAAATGGT
TGAAATAAATTGTTCTGTTGATATGTGAATTTTGGA
TTTTAAAAATGGAGATAAACTCCTGACCTGAA
TGAGCGAGATTTATCAAACTGTCAAAGA
TGCAGGGTTGGGGACAATT
CAGATGATAGCTTCAGTAAGTGGTTCA
GGTAGGCCGTCAGTGTAAAATAAGT
AGTACAAGTGCAGAGAATGACATCATG
CACTTTTGACTTTACATGGAACTTAACTCAT
TGGGACAGAGGTGGGAATTGA
GTGTGTGTGTGTGTGTGTCATCGT
GTCGTTTTTCATAGAAAATAGCTCACAGTT
TGCAGGTGGGACTTAAACACA
TGCAGGACCAACTTTCTCAT
TGCAGGAGAGCAGGGTAGA
TCTCCTGAACTAATTTAGACCTCTGAATGT
TTTGTGCGTAAAGTCAGGTAGTGT
GCCGAAAAATAAGCGATTAGTGATGA
TGCCACCTCAGTTTTAGTGTTATATCC
CATAGTATAGTGATTCGAGTCTGGAGTCT
CACAGGAAGGACGTGTTTTGATG
GCAGGGGCAGACTGAAGG
GATCATTTATCAAGACTATAGGCTATGGATACG
TGCATCCATTCATACCTGACCAATT
CACTTGTTCTGCACACTACTTGTC
GCCTAGGTATGTACGAAACTTCACA
GGTGAGCCATTCATAACAATCTT
GAGGATGGATGAGACTTTTCAGAT
AAACCACGGTATCCTTTATTCATC
GGCTTGCCTTTAGATAGAATCTTG
GGCTTTCTGATGATCTTGAACTTT
CCCTCCAAAAAGAAAACATTTGAT
AGATGATATGGATTTGCTGTGTGT
AACACAGATCAAATGTTTTCACAC
TGAGGTATTACTTGCTGAGTTTGC
GCGAGTGTTAAAAGGGTCAAA
AGCGAGGCTTGCGTTTTACT
AAGGCCGTGAACATCTGTG
GCGGTGGGATACCTCCTCTA
GCAATTACCCATGACTCTGTGA
CATTTAGCAGACACTCTTATCTTAGTGTCA
GTCACGATCAGCAATGAGACAG
ATAATGTTTCCTGACATGTTGGTG
GAAGAGTCATGTCAGGATTAGCC
CTAGTCCTGACATGTAGGCTCATC
ACTCCTGGACATTCTGTGGTG
AAAATAAACAGAATGGAGTACCATGA
CAGGTCTCTCCAGAGATGTTTGAT
TAGTCCCACTGTACATGCCTTAGA
CCCAAGTGGTGAGTGTCAGT
GACTCAGGTAAGGAAACATCAATGTCA
TCATAAACATGGTGTCTTTCAGTCAGTT
TCGCTAGGCAGAAATATAGGGTTCT
CCTGAACAGGTACACACAAACGA
CTACGCGAGAAATAACACTTTTCAAAACT
GTGGGTAATCGATGCCAAAGAGAT
TGTTTTTGGTCATGTATTTTCTCTGCTATTTTT
CCTGAGCATCCCAGTTGAACT
GCGTTACTGGTGTTATAAACGTTAGC
GTTCTTTTTAATGATGACTACAGGTCTTTCAC
CCTGGTCTGTTTGTGATCAAGATG
CTGAGCTTTTTCAACTTACTTGTTGGA
CTCGCCTCTGTCATTGTATTACCTT
TTGTTCAATGGGCATTAATGCATGTT
GCCCTTGTGACAATGCACTGTTATA
ATCCAAGGAGCCCCATTAAAGATTT
AGAGCATTCAATTTAAAAGCTGAAAACGA
CTGCTGGCGCAGACATG
CATGTTACCCAGCTAAAAGTCTATAGCA
GTTCGTGGGATTGTTCAATGTTCAT
GGCCACTGTCATAGAATTAGGCATT
GCAGGGACAGGGCCCT
GCTCTTGCCCATCTGTAGGAT
CACTGAACTGTAAGCCATTGTGATT
GGAAACCAGCTAGGATTCAGGAA
TGCTTCAGTGAAAATAAGCGTGAGA
AGTCAGTGTTGGTGTAGTGAAGAGA
CAATTACTCTTTCTCAGCCCTGTGT
ATGGGTTGGGATTATGGTTCATTGT
GGAGGTGTAGTGAAATGGGAAGAT
TGCAGGAACTTGCTATGCT
GGAGAACATGCATCACCATTCAAG
TGCAGTTACAAGCCTAAGACAATCT
CGTGTCCGGCTTCTTTTATTTCATT
CCGTCTGAGTAGGAGGATCAATACA
GGAGGCTCTACGTAGGCCT
GAGGTTTGTTACTGTCACCCATAGA
GGGCCACGGGGTTGTAAA
GGACAAGTTGAAACAGATCAGGAAGT
ACTCTCCCAGAAGGATTCAGAGA
GCTGACCACCGACCACAG
AATATTGGCTTTCTGAGAATGCATTTGG
TCTGAACTCACCAAAGGAACACTTG
TGTGTACATCCGCGTAAATATTGAAGATAA
GAGACAAAGGTTTGCAGGTTCATG
TGGTGAGAGCAGCTTTAAATGTCTT
ACTGCAGGCGTCATGCTT
TGATTGTCTCATGGCCAATTGTCA
TGACAGATTTCACCTTTAACTAGCTAAGC
GCAGGGAAAACTGGTCAGGA
TCTTTGATATTGAGCTCATAAAAGCAAGGT
GTCTCTCTCTCTTTGCATCATTACACT
TGCGCGTCTCATTCAACCAT
TGTTGTCTCGGACTGCATGAC
GGATGACTCCTACTAATAGACGGATGT
CTCTTGCTACTTGCAGTGTATCTCA
GGTTGCAGGGCAGAACTGT
CCTATTTTTGATAGGTCATAGTGAATGGGATAG
TGCAGGGACGGGGCT
GCTGTGATTGTGCTCTAAAGACATG
TGCAGGAGCTGTGATGGG
CAGGAACCTGCTTTAATGCTCT
CCTTCAAACTAACACATCATAGACATGCTT
TGCCGTGAGAAACTGGTCA
ACTTCTTGAGCCAATCGGATGATG
CTTTTCTGAATTAGTGCTGTGCTTGT
TGGGATAGAACAGGAGCTTAAACA
GTGACAGGAGACAGAAAGAGACATT
AATGGGTAACAAAGAAATAGCTAGCTACTT
ATGTCAATATATTTCACTATAATGATTGGAAGCCA
CCGACGCCTCACTGAGT
CGGTCATTGTAAATGTCAACGGTTT
GCTGAGGAAGGATTCTGTATTTGCT
ACTCTATCATCGGCAGGACCAT
TCAATGTTCATCAATGCACTTCCTGTA
GCCCTGCCTGCAACTTC
TGTTGTAATCTTTCTGAATATTTGCTTGCTT
CTCACTGCAAATCCAACTTCATCAT
GGTCTGTCTGTCTGTCTATCTGTCAATG
AGTGCAGGTCTCCAGATTTACA
GAGTGGTCAAGGTTTCAGTTTCTG
TTCCCTAATCTGACGTACTACCAACT
CTGGTCTGTGACGTCAAAATGATG
GTGATGAGAGGTTTCCGGAAAATCT
CCAGCAGAGACTGGGTTCAC
GCCTCACATTTTACTGATGTCACTTC
GAAACGTCTATGCTGTCCCCTTTAA
TTCTGGGTTGCCATACTCTTTCAAT
CGGAAGACCAGATTCTCCAAGAGTA
CAACGCGGGAATGGCTTTTAA
GAACTGAGCGGCTGCTG
CGAAATAAGGGCCTGGTGTTTAAAA
AAATGAGGCCGTCCTTTACACT
ACACATGGCTCGTCTGCA
GAGGCCTAATGTCTCTTGTGACT
GGGAGGCAGGCAAAAGGT
CGCTGGGCATGGATGAGT
TGCATGTTTTCTAACTGTGTTTTTGTGT
CTTTTCGGGTTATTCATGCTGTTGT
CTCATACTTTGTACCTGTGTGTTCCA
CCAAATACAGACCAGCTACTTGTGT
ACCAGTACCTAAACGTTAGAAAGCAA
CTCTTGTTTGCTATGGGAGATGTAGT
TAGGAGTTGGAAAGACTGCACA
GCCAAGTGATCAAGTGCTTGT
GAGGCTGACTTGGACTTTGC
CAGGCTTGTGTTAAGTAGGGAGAAA
CATAGATGTTTATATGAAAAACCTCCCACTGT
ATTTTCAAACAGGCATTTATCATTGGTGAA
CCCACTTCCAGAGCCTGAA
CAAGGGCACATTGGCAGATTTT
CGGACAAAGAGCTACAGAAATGC
ATGACCAATTGAAGAGTTCTTCCGT
GGACTCGTGCTTGAGGAAGATG
GTCCTCAGCTGGGTCAAGAG
CTGTTAGTGCAGAAGACGTAGCT
GGCCATCTTTCAGGACGTACAG
GGTGCCACTTTAGTATAGCTGCTTA
ACTGCCACAGACACGAACTC
CGATGTACTGAGGGCAGTGT
TGCACCTGCGAGAGCAT
TGTGTTTAGGATTGAACTGACCATGTT
ACACCCACTTCAACCTCCATAAC
TGTGTATTCGTCGACCGGA
GCCTACAGCAAATTCAGCTACACAT
GCCATATCCCGGGGCTTG
TCATCAAAACATGCCTCTTCTGTGT
CAGCCCGTCCCAAAATCAAG
GGCACAGCGACAGGAGTT
GCATGGCTGCCCTAGAACA
GCCAGATAGTAGCGTACATCATGAG
GGCACTCTCCCTGGCTAGA
GAGGATGACACTGTCCGTTTGT
GCAGGGTCTGTGTGGGTT
TGCAGGAGGAGGAAGGCA
AACAAAGAATGTTAAACACCAAACAGGAA
CTGCTTGTAGCCGTTCAGC
CTCAACAGTGCACCTCCCTTAATT
GCCATTTGACCAACGGAGC
TCTTTGGACTGTGTATACCAGGTGTA
CACCTTAGTTCCACGCAACATG
CCCCATATGAGACGCTACAGTAATG
ACAGAGCTGTGTCTACCAGA
GCCTGCCCTACTTATCTCTTATCA
CCCCAAAAACATCAAGAAGTCTAA
TGCAATATAGAACAAATCCGAAAA
TTTATATTCAGACATTCGCCAAAA
CCAATGGTGATTTTAGAACCATTAC
TGTGCGGAATTACTGATAATTGAC
CAAATGTAAGGATACGCTTGAATG
CTTATCTCAAAGGAATGGGAATGA
TCAAATTGAATGTAGACAGATGGAA
AACCATTGTTCTTGTATTCCTGCT
ACAACTAGTCATCGTGGAATCTGA
GAGAGAGTGCATTCTTCATCAAGTT
CTCTTTCAGTTGTCTTTGCTCTTG
TCTCAATGTGATTGAAATGGATGT
CATGAGACACCCTGGAGAAAA
GCACTGTATACAAAATCGTGTGGT
GCTAAATGTAAATCGAGTGGCTGT
GTTTTGCCAGAGAGAATGTACAAA
CCCACCATACAATAAAGGCATGT
AGCGCCTGTTTTACATAAACACTT
GTATGAGTTGTGTGGTTGCAATGT
GTAAAGAAACATGACCTTTTCTGAG
GTGCATATTTTACGTGGTTGAAGT
GCTGCTATTTCCGACCTTACAATA
CTTCCAGGAGGTATTGTTGGTTAT
AGAACCCATGCTTTCAGTACACTT
GTTCATTTTGAAATAACTGCATCG
AAATCACCCCATTTCTTTTGTG
TTATTTTGGGCTTCATATGGTTCT
AGATCAAGCTTGCTGACTTCG
CACAAATGTGACCGTTTTCATC
TGATATATTTTGCTGCAATGATCTG
AAGGAGCAGGAGATGTTATTGAAG
CTGACAAAAGTGATCTGCCTGA
GCTTAAACAGCTGCTATTAGGACA
GGGAGACTTAAAACAACCTCAAAA
TGATTTGACTTTTTGTGGTGTTTT
ATTAGTGCATATGAATCGGGCTAT
GAACTATCCTGACTCCCATTGAAA
ATAGAGCTTTTGGTGTTTCATTCC
TGCGAGATTTATCTACTTGTCCAG
CCTAAGAGGAGACGAGCATTACAG
AGAAAGCCATCATCATGAGACC
TCAAGACTGTGCTGTAGTTGTCTAC
CGCAAGTCAGCAGGGTGA
AAAATATGTGCAACATCCAATGTC
ATTATTCAAACAGAGATGGCGAAA
GGTTTGAGCCAATCAGTTGTGTT
GGCTCGAACCACCCAGTTTA
CAAACGCGCACTCACACACA
CCAGAGGTTAGATGGCCCTTT
CCTGAACAAATACTTAACGCTCCAGTT
GTGCTGCAGGAACCATGTG
AAGGTCTACTCCGGTTGTATTCGGT
CCATTCCCATCGGCATCGT
CAGGCAGTCACTGAGTCCG
CCTCTGCTGAGTTTGAGGGG
CAAGAACACCGAGATCTCCTTCA
TAACCATGACTTCTATCAATCACCCC
CCACTGGCTGTGGAGCTT
CAATGTCTAAAGTAATGGTGGTATTCTTGC
CCTTTGGGTCTGCTTGAGGTT
ATCGAGGATGCCTCAAAGACATC
GACTGTCTTGGAACCGTTGCTA
TGGGACCCACATAAAGCAACTG
CAGATGGTGCAGGCCGAA
CCAGCCCCGTAACACACAT
GCACCGTATCAACGAGCTCAT
CCTTAGCTGCTCTTTGAAGTTGACT
GGTGATAACAGGTGTTGCACCAA
GCGTACTGAGCCTGGATGACA
CCTCCAGATGAGACCCACTCT
CAGACAGGTCACCATCACACT
TGCCTAAACACTCCCAAGGT
GTCGATTACCGTTAGCTTCATCCT
GCAGGAAGCAAAGTTCGGTG
CAGTTCGCTTCTCCAGGGA
CTGCTCACCTGCATCAGTGT
ACTCTGGGTCCAGGAGGTTTT
TGAAAGATATCAATTGTAGTAGTGGTGGTG
TCGTGGATTGTGGCTTACGT
GAATGCAGGGCCAGGGAG
CAAATCAGAACAAAACCTCCCACAA
TCTCCCTCATTCCCATGTCATATCA
CTCCCCCCTGGACTTTGG
CGTGGTGTTCGCCTTCCT
TTTGAGTGAGTCACTGCACCAA
CCTGCTCTGTGTCTGGGC
CCGCCTTTCCCACCTTCTC
CTCTGCCATTCATTTGGGCTTTG
AGGTCCTCTGTCGCACCTA
TTTCTCATCCTTCTCTCTTCCAGTCT
CGCAATGAGCCAACCCCT
GGAACTTCCTCTCCCGTTCTG
CCATACAGCCAGTCCAGGTG
CCAAATCCTCATCCCACACACT
CTCCCTGTTCGCTAGCCG
ACCACCCACCTCCTCAGA
GGGAGAGGGAGACGTGGA
ATCAGGTCTGGGGCGACA
ACTTTGAGGACTTACTCCTGTCCT
GTTTTGGTGTGGTCTCAAATCC
ACCTTTTAGCCAGTGACAACATTT
TTTTGGAACCCTTTTTACTACGAG
ATTTGCTGTGTGTGGAGTGAAT
CAGCAGCTGTTTATGACTGACTTC
TTAGCAGGCGATCTAATTCTGATT
ATCATCTCTGCTCAGAGGCTATTC
CAATTGTAGCCCTCTAACTTTTCC
TATACCTTTGTAGCATCCCTCTCC
TACGGTAGGAAGACTGAATGAGTG
AGTGCTCCATGCTGGAGTTT
CAGTTTAAGTGTTACCACCACGAG
TTAAATCACCCAGAGCTTGTTAGA
CATTTCAAAATTAGGAGGTTAGGG
ATGGTTAAATTGACTCCTCCCTAT
CACATGGCTCTTTGCTCAAAAT
CAGTTCCTGACATTCACCAAAATA"
for i in $PRIMERS
do
cat merged_Ots351_hash.txt | /dfs/Omalley_Lab/dayan/software/seqkit grep -s -i -p $i | grep ">" | awk -F ";" '{sum += $3} END {print sum}' >> Ots351_primer_cts.txt
done
#save this file as a script and submit with qsub
Now we count probe sequences
#!/bin/bash
#$ -S /bin/bash
#$ -N probe_count
#$ -cwd
#$ -o $JOB_NAME.out
#$ -e $JOB_NAME.err
PROBES="TCAGCGAAGTGGAGAT,ATCTCCACTTCGCTGA
ATTAGAACTCGTAGAACTAT,ATATTAGAACTCGTATAACTAT,ATAGTTCTACGAGTTCTAAT,ATAGTTATACGAGTTCTAATAT
TAGCTCCGAGCTAA,TAGCTCTGAGCTAA,TTAGCTCGGAGCTA,TTAGCTCAGAGCTA
CACCAATCAATTAATTATT,ACCAATCAATTCATTATT,AATAATTAATTGATTGGTG,AATAATGAATTGATTGGT
TGACCTGAAAATATATATTTTT,TGACCTGAAAATACATATTTTT,ACCTGAAAATATATTTTTTT,ACCTGAAAATACATTTTTTT,AAAAATATATATTTTCAGGTCA,AAAAATATGTATTTTCAGGTCA,AAAAAAATATATTTTCAGGT,AAAAAAATGTATTTTCAGGT
CCGACACAATTTTGT,CCAACACAATTTTGT,CCGACATAATTTTGT,CCAACATAATTTTGT,ACAAAATTGTGTCGG,ACAAAATTGTGTTGG,ACAAAATTATGTCGG,ACAAAATTATGTTGG
CTAGGTGAAACTTTTTTTAAA,CTAGGTGAAACTTTTTAAAAA,TTTAAAAAAAGTTTCACCTAG,TTTTTAAAAAGTTTCACCTAG
TGTGAGGGCGGTCTT,ATGTGAGGTCGGTCTT,AAGACCGCCCTCACA,AAGACCGACCTCACAT
TTTAAGATGTAGTT,TTTAAGGTGTAGTT,AACTACATCTTAAA,AACTACACCTTAAA
CCGCTTGAAAGTTTGA,CGCTTGAAGGTTTGA,TCAAACTTTCAAGCGG,TCAAACCTTCAAGCG
CTGCCACCCTTTGA,CTGCCATCCTTTGA,TCAAAGGGTGGCAG,TCAAAGGATGGCAG
ATAAAGTGTGTTAT,ATAAAGCGTGTTAT,ATAACACACTTTAT,ATAACACGCTTTAT
CAAAGAAAATCAAAATTT,CAAAGAAAATCTAAATTT,AAATTTTGATTTTCTTTG,AAATTTAGATTTTCTTTG
CTGTATACAGTAAGAGTATTAAT,ACAGTAAGAGCATTAAT,ATTAATACTCTTACTGTATACAG,ATTAATGCTCTTACTGT
CAAAGTCAAAGATCCTATTAAA,AAGTCAAAGATCGTATTAAA,TTTAATAGGATCTTTGACTTTG,TTTAATACGATCTTTGACTT
AGAAAGTTCTAGAAATAATT,AAAGTTCTAGGAATAATT,AATTATTTCTAGAACTTTCT,AATTATTCCTAGAACTTT
CCATCCTGTCTTGTCTG,CATCCTGTCATGTCTG,CAGACAAGACAGGATGG,CAGACATGACAGGATG
CCACTACTTAACGTGCTTT,CCACTACTTAACATGCTTT,AAAGCACGTTAAGTAGTGG,AAAGCATGTTAAGTAGTGG
ACCATTTGATATAACTGCGTTAG,CATTTGATATAACGGCGTTAG,CTAACGCAGTTATATCAAATGGT,CTAACGCCGTTATATCAAATG
TTATGGCTATTATT,TTATGGTTATTATT,AATAATAGCCATAA,AATAATAACCATAA
CAATGAATACAATATCTAACCTAAT,AATGAATACAATATCTAATCTAAT,ATTAGGTTAGATATTGTATTCATTG,ATTAGATTAGATATTGTATTCATT
ATGCATTCACCTGTATTAT,TGCATTCACCAGTATTAT,ATAATACAGGTGAATGCAT,ATAATACTGGTGAATGCA
AAACAAACAACGCCTCATG,AACAAACAACACCTCATG,CATGAGGCGTTGTTTGTTT,CATGAGGTGTTGTTTGTT
TTAGTCAACTGTTGTTTTT,TTAGTCAACTGTTATTTTT,AAAAACAACAGTTGACTAA,AAAAATAACAGTTGACTAA
TGTTGGCGAAGTGGGT,TGTTGGCGAAGTGTGT,TGTTGGTGAAGTGGGT,TGTTGGTGAAGTGTGT,ACCCACTTCGCCAACA,ACACACTTCGCCAACA,ACCCACTTCACCAACA,ACACACTTCACCAACA
AGAGCATGTAGTTTTG,AGCATGTAATTTTG,CAAAACTACATGCTCT,CAAAATTACATGCT
TTTTAAAA,CTGGCATCCA,TTTTTAAACTGGCATCCA,TTTTAAAA,TGGATGCCAG,TGGATGCCAGTTTAAAAA
AGCTAGCGCTCCTC,AGCTAGTGCTCCTC,GAGGAGCGCTAGCT,GAGGAGCACTAGCT
AATGTCATATAGAAATCTAC,AATGTCATAGAAATCTACTG,GTAGATTTCTATATGACATT,CAGTAGATTTCTATGACATT
TACAGGAGATAAGGTCGCA,CAGGAGATAGGGTCGCA,TGCGACCTTATCTCCTGTA,TGCGACCCTATCTCCTG
TCTCTTATCTGAGTTCTGC,CTCTTATCTGTGTTCTGC,GCAGAACTCAGATAAGAGA,GCAGAACACAGATAAGAG
CACATAGTGTAGCTTTACTAC,CACATAGTGTAGCTCTACTAC,GTAGTAAAGCTACACTATGTG,GTAGTAGAGCTACACTATGTG
AAAACAAATCATTTTTCG,AAAAACAAATAATTTTTCG,CGAAAAATGATTTGTTTT,CGAAAAATTATTTGTTTTT
ATGTACTTTAACGATTCATTT,ATGTACTTTAACGTTTCATTT,AAATGAATCGTTAAAGTACAT,AAATGAAACGTTAAAGTACAT
CCACCGCCATCTGATA,CACCGCCGTCTGATA,TATCAGATGGCGGTGG,TATCAGACGGCGGTG
ACGGAACAAATAAGACATTT,CGGAACAAATAAGCCATTT,AAATGTCTTATTTGTTCCGT,AAATGGCTTATTTGTTCCG
TAACACATGTTGGAGGTC,AACACATGTTAGAGGTC,GACCTCCAACATGTGTTA,GACCTCTAACATGTGTT
TCCATGGAAACGGACAAT,TCCATGGTAACGGACAAT,TCCATGGAAACTGACAAT,TCCATGGTAACTGACAAT,ATTGTCCGTTTCCATGGA,ATTGTCCGTTACCATGGA,ATTGTCAGTTTCCATGGA,ATTGTCAGTTACCATGGA
CTCTACAGTATG,CTCTACAATATG,CATACTGTAGAG,CATATTGTAGAG
TGCCACATGATAATTGA,CCACATGGTAATTGA,TCAATTATCATGTGGCA,TCAATTACCATGTGG
ATCAGTGACATAAGTTGTCCA,TCAGTGACATAAATTGTCCA,TGGACAACTTATGTCACTGAT,TGGACAATTTATGTCACTGA
ATTGCCCATCTCAGAATA,AATTGCCCATCTTAGAATA,TATTCTGAGATGGGCAAT,TATTCTAAGATGGGCAATT
CAGTGTATTAGTCATTCTTA,CAGTGTATTAGTCGTTCTTA,TAAGAATGACTAATACACTG,TAAGAACGACTAATACACTG
CCCCGAAGTACTTTT,CCCGAAGAACTTTT,AAAAGTACTTCGGGG,AAAAGTTCTTCGGG
CCATTTTAATTCCA,CCATTTGAATTCCA,TGGAATTAAAATGG,TGGAATTCAAATGG
TGTTCCAATGTAAAATGTATGC,TTCCAATGTAAAATATATGC,GCATACATTTTACATTGGAACA,GCATATATTTTACATTGGAA
CTGCCTAGTTAAATAAAATA,CTGCCTAGTTAAATTAAATA,TATTTTATTTAACTAGGCAG,TATTTAATTTAACTAGGCAG
CCGAGCTTGAGTTAGGA,CCGAGCTTGACTTAGGA,TCCTAACTCAAGCTCGG,TCCTAAGTCAAGCTCGG
AGCTAGTGCTTAGCAGCTAA,AGCTAGTGCTTAGCAGCTAC,AGCTAGTGCATAGCAGCTAA,AGCTAGTGCATAGCAGCTAC,TTAGCTGCTAAGCACTAGCT,GTAGCTGCTAAGCACTAGCT,TTAGCTGCTATGCACTAGCT,GTAGCTGCTATGCACTAGCT
AGGGACAGTTTCGCAGACG,AAGGGACAGTTTCTCAGACG,CGTCTGCGAAACTGTCCCT,CGTCTGAGAAACTGTCCCTT
TCATCTTTTGTTATTTCCTTG,ATCTTTTGTTCTTTCCTTG,CAAGGAAATAACAAAAGATGA,CAAGGAAAGAACAAAAGAT
ATTTGACTTGTCTTTTT,ATTTGACTTGTGTTTTT,AAAAAGACAAGTCAAAT,AAAAACACAAGTCAAAT
CACACACATGCACG,CACACATATGCACG,CGTGCATGTGTGTG,CGTGCATATGTGTG
AACCTGTGTGATTT,AACCTGCGTGATTT,AAATCACACAGGTT,AAATCACGCAGGTT
ATGATAAT,ATGATATT,ATTATCAT,AATATCAT
CTGAATGTTTTTTTTAATCTTT,CTGAATGTTTTTTTTTATCTTT,AAAGATTAAAAAAAACATTCAG,AAAGATAAAAAAAAACATTCAG
CTGTGTGTCTAAGACAAT,CTGTGTGTCTATGACAAT,ATTGTCTTAGACACACAG,ATTGTCATAGACACACAG
AATTGCCTCATTGGGTG,AATTGCCTCATTAGGTG,CACCCAATGAGGCAATT,CACCTAATGAGGCAATT
AAACCATTTTCATTCTTTTG,CCATTTTCACTCTTTTG,CAAAAGAATGAAAATGGTTT,CAAAAGAGTGAAAATGG
GAGACTGTTGAGAC,GAGACTATTGAGAC,GTCTCAACAGTCTC,GTCTCAATAGTCTC
ATAACATCTGCAGCATTAA,ATAACATGTGCAGCATTAA,TTAATGCTGCAGATGTTAT,TTAATGCTGCACATGTTAT
GTTTGGCATAAAGT,GTTTGGAATAAAGT,ACTTTATGCCAAAC,ACTTTATTCCAAAC
CCTTAAGCATATTTCT,CCTTAAGCGTATTTCT,AGAAATATGCTTAAGG,AGAAATACGCTTAAGG
TTTCCAATGGTATAGATATGA,TTTCCAATGATATAGATATGA,TCATATCTATACCATTGGAAA,TCATATCTATATCATTGGAAA
TGGGAAGCAATCAA,AATTGGGAAGCAGTCAA,TTGATTGCTTCCCA,TTGACTGCTTCCCAATT
CATAGACAGGGGCCAT,CATAGACGGGGGCCAT,CATACACAGGGGCCAT,CATACACGGGGGCCAT,ATGGCCCCTGTCTATG,ATGGCCCCCGTCTATG,ATGGCCCCTGTGTATG,ATGGCCCCCGTGTATG
ACATTACTTTTCAAAAATATT,ACATTACTTTACAAAAATATT,AATATTTTTGAAAAGTAATGT,AATATTTTTGTAAAGTAATGT
ATAAAAAATTCTGCGTGAATG,ATAAAAAATTATGCGTGAATG,CATTCACGCAGAATTTTTTAT,CATTCACGCATAATTTTTTAT
AATATATTTTTTATAGGC,AATATATATTTTATAGGC,GCCTATAAAAAATATATT,GCCTATAAAATATATATT
AAAGCTGATTAAAAA,AAAGCTGATTTAAAA,TTTTTAATCAGCTTT,TTTTAAATCAGCTTT
TAAAAATGGTTGATATGTA,TAAAAATGATTGATATGTA,TACATATCAACCATTTTTA,TACATATCAATCATTTTTA
GACACACTCACGA,GACACACTCATGA,TCGTGAGTGTGTC,TCATGAGTGTGTC
TTCTCAAGTCCTACTCAACTG,TTCTCAAGTCGTACTCAACTG,CAGTTGAGTAGGACTTGAGAA,CAGTTGAGTACGACTTGAGAA
GTGATAGTTTGATAGTTTTAT,GTGATAGTTTCATAGTTTTAT,ATAAAACTATCAAACTATCAC,ATAAAACTATGAAACTATCAC
CTGAATCCTGTAAG,CTGCATCCTGTAAG,CTTACAGGATTCAG,CTTACAGGATGCAG
GGAGATAGTCAGGG,GGAAATAGTCAGGG,CCCTGACTATCTCC,CCCTGACTATTTCC
CCGCAACAGATC,CTGCAACAGATC,GATCTGTTGCGG,GATCTGTTGCAG
CCTCACATACTCCCTT,CCTCATATACTCCCTT,AAGGGAGTATGTGAGG,AAGGGAGTATATGAGG
GGCGGCTCGGAAAATTATTTT,GGCGGCTCGGGAAATTATTTT,AAAATAATTTTCCGAGCCGCC,AAAATAATTTCCCGAGCCGCC
ATTGTGCTTATCACA,ATTGTGCTTAGCACA,TGTGATAAGCACAAT,TGTGCTAAGCACAAT
AATAGGCCGACATCAA,AAATAGGCCAACATCAA,TTGATGTCGGCCTATT,TTGATGTTGGCCTATTT
CTGTAGTGACGCCGCAACAC,CTGTAGTGACACCGCAACAC,GTGTTGCGGCGTCACTACAG,GTGTTGCGGTGTCACTACAG
GCATAGTTTGG,GCATGGTTTGG,CCAAACTATGC,CCAAACCATGC
GTAAAGCAAA,GTAACGCAAA,TTTGCTTTAC,TTTGCGTTAC
GCCACACTGTT,GCCATACTGTT,AACAGTGTGGC,AACAGTATGGC
TGAGACACAGTAAACCTTCTT,TGAGACACAGCAAACCTTCTT,AAGAAGGTTTACTGTGTCTCA,AAGAAGGTTTGCTGTGTCTCA
TCAGCTATGTAG,TTAGCTATGTAG,CTACATAGCTGA,CTACATAGCTAA
TCCGGGCTCTGGCTGGGGACA,TCCGGGCTCTTGCTGGGGACA,TGTCCCCAGCCAGAGCCCGGA,TGTCCCCAGCAAGAGCCCGGA
CCCACACATGTGGTGACCTCA,CCCACACATGCGGTGACCTCA,TGAGGTCACCACATGTGTGGG,TGAGGTCACCGCATGTGTGGG
CCCTATTCTCCAATCCATAT,CCCTGTTCTCCAATCCATAT,CCCTATTCTCTAATCCATATG,CCCTGTTCTCTAATCCATATG,ATATGGATTGGAGAATAGGG,ATATGGATTGGAGAACAGGG,CATATGGATTAGAGAATAGGG,CATATGGATTAGAGAACAGGG
AACCAGTAGAATAACC,CAGTGGAATAACC,GGTTATTCTACTGGTT,GGTTATTCCACTG
CAATCTATCATCGACCAGC,CAATCTATCATCAACCAGC,GCTGGTCGATGATAGATTG,GCTGGTTGATGATAGATTG
TGCTAAATGGCATATATTAT,CTAAATGGCACATATTAT,ATAATATATGCCATTTAGCA,ATAATATGTGCCATTTAG
CACTCTTTATATCCACACCG,CACTCTTTATATCCACACCA,CAGTCTTTATATCCACACCG,CAGTCTTTATATCCACACCA,CGGTGTGGATATAAAGAGTG,TGGTGTGGATATAAAGAGTG,CGGTGTGGATATAAAGACTG,TGGTGTGGATATAAAGACTG
TCTTGCAATCATTTTTAAC,CTTGCAATCATATTTAAC,GTTAAAAATGATTGCAAGA,GTTAAATATGATTGCAAG
AAGCGACTTGATTATC,AGCGACATGATTATC,GATAATCAAGTCGCTT,GATAATCATGTCGCT
ATGTCTGAAATGAAAGCC,AATGTCTGAAATTAAAGCC,GGCTTTCATTTCAGACAT,GGCTTTAATTTCAGACATT
TCCTGAAAAACGACATCC,CTGAAAAACAACATCC,GGATGTCGTTTTTCAGGA,GGATGTTGTTTTTCAG
CTTTCGTCCTTAGCACATAG,CTTTCGTCCTTAACACATAG,CTATGTGCTAAGGACGAAAG,CTATGTGTTAAGGACGAAAG
TATCTGGGCGGGCTG,CTATCTGGACGGGCTG,CAGCCCGCCCAGATA,CAGCCCGTCCAGATAG
ATGTATTGTTCATTTAATG,TGTATTGTTCGTTTAATG,CATTAAATGAACAATACAT,CATTAAACGAACAATACA
TAGCGCAAACCCCGAACC,CGCAAACACCGAACC,GGTTCGGGGTTTGCGCTA,GGTTCGGTGTTTGCG
AAAATGTACCACATACTTGT,AAATGTACCACATACTCGT,ACAAGTATGTGGTACATTTT,ACGAGTATGTGGTACATTT
AAGAGTCCAGCGTTACTT,AAGAGTCCAGTGTTACTT,AAGTAACGCTGGACTCTT,AAGTAACACTGGACTCTT
AAGTAACGTATCAAATGGC,AAAGTAACGTATCATATGGC,GCCATTTGATACGTTACTT,GCCATATGATACGTTACTTT
CTGTGTTGAATTTAACATAAT,TCTGTGTTGAATTTAACGTAAT,ATTATGTTAAATTCAACACAG,ATTACGTTAAATTCAACACAGA
TGCATTGCTAAGACTTG,TGCTTTGCTAAGACTTG,TGCATTGTTAAGACTTG,TGCTTTGTTAAGACTTG,CAAGTCTTAGCAATGCA,CAAGTCTTAGCAAAGCA,CAAGTCTTAACAATGCA,CAAGTCTTAACAAAGCA
CCGGTTTACCGATTTG,CGGTTTACCAATTTG,CAAATCGGTAAACCGG,CAAATTGGTAAACCG
CTAAAATGTCATGTAAATAT,ACTAAAATGTCATATAAATAT,ATATTTACATGACATTTTAG,ATATTTATATGACATTTTAGT
CCTTGGATGGGA,CCTTGGATAGGA,TCCCATCCAAGG,TCCTATCCAAGG
ATGCTATTAAATGAATATTC,ATGCTATTAAATGACTATTC,GAATATTCATTTAATAGCAT,GAATAGTCATTTAATAGCAT
TAAAAAAATATAAA,TAAAAATATATAAA,TTTATATTTTTTTA,TTTATATATTTTTA
TGTAGCTAATTTTAAGTTCTC,AGCTAATTTTAAATTCTC,GAGAACTTAAAATTAGCTACA,GAGAATTTAAAATTAGCT
CACTACGGTAAGACCAT,CACTACAGTAAGACCAT,ATGGTCTTACCGTAGTG,ATGGTCTTACTGTAGTG
ATATGGTATGTAGAGGCTAGTTA,TATGTAGAGGCAAGTTA,TAACTAGCCTCTACATACCATAT,TAACTTGCCTCTACATA
ACTGTATATGTTACGTTTTC,ACTGTATATGTTAAGTTTTC,GAAAACGTAACATATACAGT,GAAAACTTAACATATACAGT
AGTTACAAGTGGTGTTTCA,ACAAGTGGCGTTTCA,TGAAACACCACTTGTAACT,TGAAACGCCACTTGT
TTTAACAAGAAAATTATACATTTC,CAAGAAAGTTATACATTTC,GAAATGTATAATTTTCTTGTTAAA,GAAATGTATAACTTTCTTG
AGACATGTAGCTATGTAGGTAA,AGACATGTAGCTATCTAGGTAA,TTACCTACATAGCTACATGTCT,TTACCTAGATAGCTACATGTCT
ATGCATAAAAGGTAATTGTG,ATGCATAAAAGGTCATTGTG,CACAATTACCTTTTATGCAT,CACAATGACCTTTTATGCAT
CATTCAAAAAGTAT,CATTCAGAAAGTAT,ATACTTTTTGAATG,ATACTTTCTGAATG
ATGAGAGAGTCTTTCTCTGTT,ATGAGAGAGTCTTTTTCTGTT,AACAGAGAAAGACTCTCTCAT,AACAGAAAAAGACTCTCTCAT
ATTCTTCCTCTACAATTG,ATTCTTCCTCCACAATTG,ATACTTCCTCTACAATTG,ATACTTCCTCCACAATTG,CAATTGTAGAGGAAGAAT,CAATTGTGGAGGAAGAAT,CAATTGTAGAGGAAGTAT,CAATTGTGGAGGAAGTAT
TTGTGCATTTTCCCC,TGTGCATTTCCCCC,GGGGAAAATGCACAA,GGGGGAAATGCACA
CAAACCAGCAAACAT,ACAAACCAGAAAACAT,ATGTTTGCTGGTTTG,ATGTTTTCTGGTTTGT
AATTTGAATGACCA,AATTTGTATGACCA,TGGTCATTCAAATT,TGGTCATACAAATT
CTACTTATGTAGCATTTTAA,CTACTTATGTAGGATTTTAA,TTAAAATGCTACATAAGTAG,TTAAAATCCTACATAAGTAG
CTCTGCCCCTGGAC,CTCTACCCCTGGAC,CTCTGCTCCTGGAC,CTCTACTCCTGGAC,GTCCAGGGGCAGAG,GTCCAGGGGTAGAG,GTCCAGGAGCAGAG,GTCCAGGAGTAGAG
AATCCCTCCTTTTTCC,TCCCTCATTTTTCC,GGAAAAAGGAGGGATT,GGAAAAATGAGGGA
AGAGAGGGGTCAAA,AGAGAGTGGTCAAA,TTTGACCCCTCTCT,TTTGACCACTCTCT
AGACTGGTAAAAGA,AGACTGGTAAAAGT,AGACTGATAAAAGA,AGACTGATAAAAGT,TCTTTTACCAGTCT,ACTTTTACCAGTCT,TCTTTTATCAGTCT,ACTTTTATCAGTCT
TCAAAGATATGATTCAATTAA,AAGATATGGTTCAATTAA,TTAATTGAATCATATCTTTGA,TTAATTGAACCATATCTT
CAACTGAAGAAAATAATATG,CTGAAGAAAAGAATATG,CATATTATTTTCTTCAGTTG,CATATTCTTTTCTTCAG
CAGGTTAGGAATGGTTG,CAGGTTAGGATTGGTTG,CAACCATTCCTAACCTG,CAACCAATCCTAACCTG
CCTGTTATCAGACCCAAAT,CTGTTATCAGCCCCAAAT,ATTTGGGTCTGATAACAGG,ATTTGGGGCTGATAACAG
TCCCAAAGTCGAGTGTG,CCCAAAGTCAAGTGTG,CACACTCGACTTTGGGA,CACACTTGACTTTGGG
GTATATTTAGAATG,GTATATGTAGAATG,CATTCTAAATATAC,CATTCTACATATAC
TGTTCGAGAATGAAGATGAGTAA,TCGAGAATGAAGGTGAGTAA,TTACTCATCTTCATTCTCGAACA,TTACTCACCTTCATTCTCGA
AAGGCTTTGGTTGTTTG,AAGGCTTTGATTGTTTG,CAAACAACCAAAGCCTT,CAAACAATCAAAGCCTT
GAATGGTGTTAAAT,GAATGGCGTTAAAT,ATTTAACACCATTC,ATTTAACGCCATTC
CATTTACCAGTTCTCACACAC,TTTACCAGTTCACACACAC,GTGTGTGAGAACTGGTAAATG,GTGTGTGTGAACTGGTAAA
TGGTTCCCCAAATTT,TGATGGTTCCCCTAATTT,AAATTTGGGGAACCA,AAATTAGGGGAACCATCA
AGAATGAAGTGAAAAGAA,AGAATGAAGTAAAAAGAA,TTCTTTTCACTTCATTCT,TTCTTTTTACTTCATTCT
CGTATGTGCAATGCATG,CGTATGTGCATTGCATG,CATGCATTGCACATACG,CATGCAATGCACATACG
TTGGTAAACCTGTTTATTGGTA,TGGTAAACCTGTTTTTTGGTA,TACCAATAAACAGGTTTACCAA,TACCAAAAAACAGGTTTACCA
TTCTGTTACTGGAC,CTGTTACTGGGC,GTCCAGTAACAGAA,GCCCAGTAACAG
CCTGCAATACGACCAAC,CTGCAATACAACCAAC,GTTGGTCGTATTGCAGG,GTTGGTTGTATTGCAG
CTATCAAAGCAATACATTG,CTATCAAAGCAGTACATTG,CAATGTATTGCTTTGATAG,CAATGTACTGCTTTGATAG
ACAAATTAATTAAA,ACAAATGAATTAAA,TTTAATTAATTTGT,TTTAATTCATTTGT
CAGTTCTGTAATGCATT,CAGTTTTGTAATGCATT,AATGCATTACAGAACTG,AATGCATTACAAAACTG
AACTGTTCAAACCC,AACTGTCCAAACCC,GGGTTTGAACAGTT,GGGTTTGGACAGTT
TTTCTACTTAGTAA,TTTCTATTTAGTAA,TTACTAAGTAGAAA,TTACTAAATAGAAA
TCTGGCGGATTTACA,CTGGCGGGTTTACA,TGTAAATCCGCCAGA,TGTAAACCCGCCAG
AAGATGGTATGTAT,AAGATGTTATGTAT,ATACATACCATCTT,ATACATAACATCTT
CTTAGACGTCAGAGGTC,CTTAGACGTCCGAGGTC,GACCTCTGACGTCTAAG,GACCTCGGACGTCTAAG
TGTTACGGGACATACT,TCTGTTACGGACATACT,AGTATGTCCCGTAACA,AGTATGTCCGTAACAGA
AGAAGCCCAGCTCC,AGAAGCTCAGCTCC,GGAGCTGGGCTTCT,GGAGCTGAGCTTCT
CAGCACATAACTTGACCTC,AGCACATAACCTGACCTC,GAGGTCAAGTTATGTGCTG,GAGGTCAGGTTATGTGCT
ACTGGGAAGATTGTTTG,CTGGGAAGACTGTTTG,CAAACAATCTTCCCAGT,CAAACAGTCTTCCCAG
CTATAAAGTTGGACAGTTGG,AAAGTTGGGCAGTTGG,CCAACTGTCCAACTTTATAG,CCAACTGCCCAACTTT
TGAGTCCCTGACCAGC,AGTCCCCGACCAGC,GCTGGTCAGGGACTCA,GCTGGTCGGGGACT
CCTGTCTCATTCCC,CTGTCCCATTCCC,GGGAATGAGACAGG,GGGAATGGGACAG
CTTTCCCCGTGTTGGT,ACTTTCCCTGTGTTGGT,ACCAACACGGGGAAAG,ACCAACACAGGGAAAGT
CATTTTTCAGAATTGTATTC,CATTTTTCAGAATTCTATTC,GAATACAATTCTGAAAAATG,GAATAGAATTCTGAAAAATG
CTGGTACCCA,CTGATACCCA,TGGGTACCAG,TGGGTATCAG
CAGTGTCATTTTCGGC,ATCAGTGTCATCTTCGGC,GCCGAAAATGACACTG,GCCGAAGATGACACTGAT
CAGGCGGTTCTCC,CAGGCAGTTCTCC,GGAGAACCGCCTG,GGAGAACTGCCTG
ACCCACCAGTGTCATT,CCACCAGCGTCATT,AATGACACTGGTGGGT,AATGACGCTGGTGG
CCAGTCATGGGTCATT,TCCAGTCATTGGTCATT,AATGACCCATGACTGG,AATGACCAATGACTGGA
CAATCGGAAGTCGG,CAATCGTAAGTCGG,CCGACTTCCGATTG,CCGACTTACGATTG
ACGCTCGGAACATT,ACGCTCTGAACATT,AATGTTCCGAGCGT,AATGTTCAGAGCGT
ACGAGACAGATATTC,ACGAGACTGATATTC,GAATATCTGTCTCGT,GAATATCAGTCTCGT
CTGCAACTCGACGCAAG,ACTGCAACTCTACGCAAG,CTTGCGTCGAGTTGCAG,CTTGCGTAGAGTTGCAGT
ATGGGAGACAGATAACT,ATGGGAGACATATAACT,AGTTATCTGTCTCCCAT,AGTTATATGTCTCCCAT
TGGGGCAACGCACAATTGGCT,TGGGGCGACGCACAATTGGCT,AGCCAATTGTGCGTTGCCCCA,AGCCAATTGTGCGTCGCCCCA
AGCCTAGCTCTCGGAAG,AGCCTAGTTCTCGGAAG,CTTCCGAGAGCTAGGCT,CTTCCGAGAACTAGGCT
CACCACTAGAACTCTC,CACCACTAAAACTCTC,GAGAGTTCTAGTGGTG,GAGAGTTTTAGTGGTG
ATTCTGACAGCTGTTTTG,CTGACAGCCGTTTTG,CAAAACAGCTGTCAGAAT,CAAAACGGCTGTCAG
ATAGAACTACAATTCACATATAT,AACTACAATTCGCATATAT,ATATATGTGAATTGTAGTTCTAT,ATATATGCGAATTGTAGTT
CTACCGTACTGAACTC,CCGTACGGAACTC,GAGTTCAGTACGGTAG,GAGTTCCGTACGG
CCCGGACAAGATGAGACAG,CCCGGACAAGATGAGACCG,CTGTCTCATCTTGTCCGGG,CGGTCTCATCTTGTCCGGG
TGTCCTGTCCTCAGATCA,TCCTGTCCCCAGATCA,TGATCTGAGGACAGGACA,TGATCTGGGGACAGGA
CTGAGATCACTTTGAGCAC,ACTGAGATCACTGAGCAC,GTGCTCAAAGTGATCTCAG,GTGCTCAGTGATCTCAGT
GAACTTAAAACACT,GAACTTGAAACACT,AGTGTTTTAAGTTC,AGTGTTTCAAGTTC
ATTACCAACGGAGAACC,TTACCAACAGAGAACC,GGTTCTCCGTTGGTAAT,GGTTCTCTGTTGGTAA
GGGATGGAGTATTT,GGGAAGGAGTATTT,GGGATGAAGTATTT,GGGAAGAAGTATTT,AAATACTCCATCCC,AAATACTCCTTCCC,AAATACTTCATCCC,AAATACTTCTTCCC
CTACTGTTGTATTTTCTC,CTGTTGTGTTTTCTC,GAGAAAATACAACAGTAG,GAGAAAACACAACAG
AGATCATGGGAATCATAT,ATCATGGGCATCATAT,ATATGATTCCCATGATCT,ATATGATGCCCATGAT
AGACCACAAGATACAGTACC,AGACCACAAGATAGTACC,GGTACTGTATCTTGTGGTCT,GGTACTATCTTGTGGTCT
CCACCATCAAGCACTG,CCACCATCATGCACTG,CAGTGCTTGATGGTGG,CAGTGCATGATGGTGG
ATTCAAAGTCAAATTTT,ATTCAAAGTCTAATTTT,AAAATTTGACTTTGAAT,AAAATTAGACTTTGAAT
CTGCCATGAAGTGCTAG,TGCCATGAAATGCTAG,CTAGCACTTCATGGCAG,CTAGCATTTCATGGCA
ATTTGCGTCTTCTCCC,TTGCGTCCTCTCCC,GGGAGAAGACGCAAAT,GGGAGAGGACGCAA
CGATCTGGACCAGGCT,TGATCTGGACCAGGCT,CGATTTGGACCAGGCT,TGATTTGGACCAGGCT,AGCCTGGTCCAGATCG,AGCCTGGTCCAGATCA,AGCCTGGTCCAAATCG,AGCCTGGTCCAAATCA
CAGAGCATGTGCTG,CAGAGCGTGTGCTG,CAGCACATGCTCTG,CAGCACACGCTCTG
ACAGAAGATTTTCGGCTGC,ACAGAAGATTTTCGACTGC,GCAGCCGAAAATCTTCTGT,GCAGTCGAAAATCTTCTGT
CAGTTTCACTTAATTTTAAAATG,TTTCACTTAATTTAAAAATG,CATTTTAAAATTAAGTGAAACTG,CATTTTTAAATTAAGTGAAA
ACTCACACTCGAGTGACT,ACTCACACTCAAGTGACT,AGTCACTCGAGTGTGAGT,AGTCACTTGAGTGTGAGT
AGAGATGCAAAGTGGAGTT,AGAGATGCAAAATGGAGTT,AACTCCACTTTGCATCTCT,AACTCCATTTTGCATCTCT
ACCGTAGCTGCACCTG,CGTAGCAGCACCTG,CAGGTGCAGCTACGGT,CAGGTGCTGCTACG
CAATGATTAATGATTAATCCTTC,TGATTAATGATTCATCCTTC,GAAGGATTAATCATTAATCATTG,GAAGGATGAATCATTAATCA
CCGCCACCTTGGCT,CGCCACATTGGCT,AGCCAAGGTGGCGG,AGCCAATGTGGCG
TTCATAATTGAACGATTTCA,CATAATTGAACAATTTCA,TGAAATCGTTCAATTATGAA,TGAAATTGTTCAATTATG
TCTGGATGGAACCGTTAG,CTGGATGGAGCCGTTAG,CTAACGGTTCCATCCAGA,CTAACGGCTCCATCCAG
CTGGAGCGTTTCTGTA,CTGGAGCGTGTCTGTA,TACAGAAACGCTCCAG,TACAGACACGCTCCAG
TGGGTCTCGAGCCTGTA,TGGGTCTCGATCCTGTA,TACAGGCTCGAGACCCA,TACAGGATCGAGACCCA
AACGGGCATGAACGACTT,AACTGGCATGAACGACTT,AACGGGCATGAATGACTT,AACTGGCATGAATGACTT,AAGTCGTTCATGCCCGTT,AAGTCGTTCATGCCAGTT,AAGTCATTCATGCCCGTT,AAGTCATTCATGCCAGTT
CCCCCATATTGCTG,CCCCACATTGCTG,CAGCAATATGGGGG,CAGCAATGTGGGG
CCCACTTCGCTGAAGT,CCCACTTCACTGAAGT,ACTTCAGCGAAGTGGG,ACTTCAGTGAAGTGGG
CAAGAGTGGCATAAAA,CAAGAGTGGAATAAAA,TTTTATGCCACTCTTG,TTTTATTCCACTCTTG
CTGTGGTTTGTGGCGTG,CTGTGGTTTGTAGCGTG,CACGCCACAAACCACAG,CACGCTACAAACCACAG
AACATAACGGACTCCC,TAGAACATAACTGACTCCC,GGGAGTCCGTTATGTT,GGGAGTCAGTTATGTTCTA
CTCTCCTGATCACTCTGT,CTCTCCTGATCCCTCTGT,ACAGAGTGATCAGGAGAG,ACAGAGGGATCAGGAGAG
ATTAAACGTCTGGA,ATTAAACGTATGGA,ATTAAATGTCTGGA,ATTAAATGTATGGA,TCCAGACGTTTAAT,TCCATACGTTTAAT,TCCAGACATTTAAT,TCCATACATTTAAT
CATCACAACGATGTGTG,CACATCACAACTATGTGTG,CACACATCGTTGTGATG,CACACATAGTTGTGATGTG
GGGCTTGGGGGCAT,GGGCTTAGGGGCAT,ATGCCCCCAAGCCC,ATGCCCCTAAGCCC
TTTAGACTTTGCTCTATAACAG,ACTTTGCTCCATAACAG,CTGTTATAGAGCAAAGTCTAAA,CTGTTATGGAGCAAAGT
TTTCTTGTAGGCGTCAGAG,TCTTGTAGGCATCAGAG,CTCTGACGCCTACAAGAAA,CTCTGATGCCTACAAGA
CTCCTCAGGTGGGC,CTCCTCAAGTGGGC,GCCCACCTGAGGAG,GCCCACTTGAGGAG
CGTCGCATTCAGC,CGTCGCGTTCAGC,GCTGAATGCGACG,GCTGAACGCGACG
CATGAGGCGTTCGGC,ATGAGGCATTCGGC,GCCGAACGCCTCATG,GCCGAATGCCTCAT
CAGGATAATAACAAACAAG,CAGGATAATAACGAACAAG,CTTGTTTGTTATTATCCTG,CTTGTTCGTTATTATCCTG
ATCGACCCTGTCATTAG,CGACCCTGTGATTAG,CTAATGACAGGGTCGAT,CTAATCACAGGGTCG
GCACCACTGGACCC,GCACCATTGGACCC,GGGTCCAGTGGTGC,GGGTCCAATGGTGC
GAGGCCCCAGATTC,GAGGCCACAGATTC,GAATCTGGGGCCTC,GAATCTGTGGCCTC
CTACGTAATGAACGTTAGCT,ACGTAATGAACATTAGCT,AGCTAACGTTCATTACGTAG,AGCTAATGTTCATTACGT
GAACTCGTCGTTGG,GAACTCATCGTTGG,CCAACGACGAGTTC,CCAACGATGAGTTC
TCACATCCAACTCAGTACT,CATCCAACGCAGTACT,AGTACTGAGTTGGATGTGA,AGTACTGCGTTGGATG
ATAGGAGAATTGGA,ATAGGATAATTGGA,TCCAATTCTCCTAT,TCCAATTATCCTAT
TACATATGACTAATGAAA,TACATACGACTAATGAAA,TTTCATTAGTCATATGTA,TTTCATTAGTCGTATGTA
TCTATGGTGTGATTCATT,TTCTATGGTGTAATTCATT,AATGAATCACACCATAGA,AATGAATTACACCATAGAA
CTGGAAGAAGGCCTC,CTGGAAAAAGGCCTC,GAGGCCTTCTTCCAG,GAGGCCTTTTTCCAG
TTTTTGTTCAAAAG,TTTTTGGTCAAAAG,CTTTTGAACAAAAA,CTTTTGACCAAAAA
TTTGCCAAAGAGTTCAGATAC,TTTGCCAAAGTGTTCAGATAC,GTATCTGAACTCTTTGGCAAA,GTATCTGAACACTTTGGCAAA
CTACCTACCTTAGTGCTC,CTACCTACCTCAGTGCTC,GAGCACTAAGGTAGGTAG,GAGCACTGAGGTAGGTAG
CAATGAAGTTAATTTAATTGG,CAATGAAGTTCATTTAATTGG,CCAATTAAATTAACTTCATTG,CCAATTAAATGAACTTCATTG
CATTTAAAATGGTAAAAATCA,CATTTAAAATTGTAAAAATCA,TGATTTTTACCATTTTAAATG,TGATTTTTACAATTTTAAATG
AGAGTTGAATGGC,AGTGTTGAATGGC,GCCATTCAACTCT,GCCATTCAACACT
GAAGGCCAAATAAAATTG,GAAGGCCGAATAAAATTG,CAATTTTATTTGGCCTTC,CAATTTTATTCGGCCTTC
ATTGCATACTCGAGTCATCCA,ATTGCATACTTGAGTCATCCA,TGGATGACTCGAGTATGCAAT,TGGATGACTCAAGTATGCAAT
TGAGTTTTCAAGGGGTT,TGAGTTTTCAGGGGGTT,AACCCCTTGAAAACTCA,AACCCCCTGAAAACTCA
GCTAGCAAACGTCGCCA,GCTAGCAAACATCGCCA,TGGCGACGTTTGCTAGC,TGGCGATGTTTGCTAGC
GGTGGAGGGAAAAAGCAGTG,GGTGGAGGGGAAAAGCAGTG,CACTGCTTTTTCCCTCCACC,CACTGCTTTTCCCCTCCACC
TGGTCTACTTTGTGC,TGGTCTACTTCGTGC,GCACAAAGTAGACCA,GCACGAAGTAGACCA
CAGGTTGTTGGTTGTT,CAGGTTGTTGTTTGTT,AACAACCAACAACCTG,AACAAACAACAACCTG
TCCCCACCAAAATTAAGCAAA,TCCCCACCAAGATTAAGCAAA,TTTGCTTAATTTTGGTGGGGA,TTTGCTTAATCTTGGTGGGGA
CCCTGGAGATCT,CTCTGGAGATCT,AGATCTCCAGGG,AGATCTCCAGAG
ATGTTACATGTA,ACGTTACATGTA,TACATGTAACAT,TACATGTAACGT
TTTTTGTGTCCGCCATGAATT,TTTTTGTGTCTGCCATGAATT,AATTCATGGCGGACACAAAAA,AATTCATGGCAGACACAAAAA
CAAAAGTCTGTATTTTCAAAA,CAAAAGTCTGCATTTTCAAAA,TTTTGAAAATACAGACTTTTG,TTTTGAAAATGCAGACTTTTG
ACACACACAAGAGACACCCAC,ACACACACAAAAGACACCCAC,GTGGGTGTCTCTTGTGTGTGT,GTGGGTGTCTTTTGTGTGTGT
AACATATGAGTTGTAATGCCC,AACATATGAGATGTAATGCCC,GGGCATTACAACTCATATGTT,GGGCATTACATCTCATATGTT
AAATAAACGCTGGGTCTAATT,AAATAAACGCCGGGTCTAATT,AATTAGACCCAGCGTTTATTT,AATTAGACCCGGCGTTTATTT
TCCCTTGTCTATGGTATATCT,TCCCTTGTCTCTGGTATATCT,AGATATACCATAGACAAGGGA,AGATATACCAGAGACAAGGGA
TAGCCTTAAGCGCTTCCTGCC,TAGCCTTAAGAGCTTCCTGCC,GGCAGGAAGCGCTTAAGGCTA,GGCAGGAAGCTCTTAAGGCTA
CTCTCTGCTTGCGTT,CTCTCTGCTTTCGTT,AACGCAAGCAGAGAG,AACGAAAGCAGAGAG
GCTATTAAAAGG,GTTATTAAAAGG,CCTTTTAATAGC,CCTTTTAATAAC
CCATTATCATTAT,CCCTTATCATTAT,ATAATGATAATGG,ATAATGATAAGGG
TCAAGTGTTTCCTTTATTTTG,TCAAGTGTTTACTTTATTTTG,CAAAATAAAGGAAACACTTGA,CAAAATAAAGTAAACACTTGA
GCCTGACTGGACAACCATTTG,GCCTGACTGGGCAACCATTTG,CAAATGGTTGTCCAGTCAGGC,CAAATGGTTGCCCAGTCAGGC
CCTTTGTCACCGCTCATCAGC,CCTTTGTCACTGCTCATCAGC,GCTGATGAGCGGTGACAAAGG,GCTGATGAGCAGTGACAAAGG
AATGCCATTTTGT,AAAGCCATTTTGT,ACAAAATGGCATT,ACAAAATGGCTTT
GGGCCTTCGGGGTGCCTGTCC,GGGCCTTCGGTGTGCCTGTCC,GGACAGGCACCCCGAAGGCCC,GGACAGGCACACCGAAGGCCC
CATGTCAGTGC,CATGTCACTGC,GCACTGACATG,GCAGTGACATG
TAACTTACAGTC,TAACTTACAGCC,GACTGTAAGTTA,GGCTGTAAGTTA
GGGAGAGGAGGCCTGTCTTTA,GGGAGAGGAGACCTGTCTTTA,TAAAGACAGGCCTCCTCTCCC,TAAAGACAGGTCTCCTCTCCC
TGTGTCTGAGA,TGTGTCCGAGA,TCTCAGACACA,TCTCGGACACA
GAAAACTCTGCCCTG,GAAAACTCTGTCCTG,CAGGGCAGAGTTTTC,CAGGACAGAGTTTTC
CCATATGTCGCTTGT,CCATATGTCGTTTGT,ACAAGCGACATATGG,ACAAACGACATATGG
CTGGCGGGGTCTGGG,CTGGCGGGGTATGGG,CCCAGACCCCGCCAG,CCCATACCCCGCCAG
GGAGTCAGATAC,GAAGTCAGATAC,GTATCTGACTCC,GTATCTGACTTC
TGCAAGTCCTTCAAAGGCTCA,TGCAAGTCCTCCAAAGGCTCA,TGAGCCTTTGAAGGACTTGCA,TGAGCCTTTGGAGGACTTGCA
CCAGTGAGATGCTGTGTTGCA,CCAGTGAGATACTGTGTTGCA,TGCAACACAGCATCTCACTGG,TGCAACACAGTATCTCACTGG
ACTGAAGGAATTTAAC,ACTGAAGGAAGTTAAC,GTTAAATTCCTTCAGT,GTTAACTTCCTTCAGT
TACAGTTTCCTGTCTGA,TACAGTTTCCAGTCTGA,TCAGACAGGAAACTGTA,TCAGACTGGAAACTGTA
AACGTGACACAAT,AACGTGACACGAT,ATTGTGTCACGTT,ATCGTGTCACGTT
CAACATTCCAGTCTGAAAC,CATTCCAGCCTGAAAC,GTTTCAGACTGGAATGTTG,GTTTCAGGCTGGAATG
GTGAACCAATCAAT,GTGAGCCAATCAAT,GTGAACTAATCAAT,GTGAGCTAATCAAT,ATTGATTGGTTCAC,ATTGATTGGCTCAC,ATTGATTAGTTCAC,ATTGATTAGCTCAC
GTCAAACCAACTTTGCCAAGG,GTCAAACCAATTTTGCCAAGG,CCTTGGCAAAGTTGGTTTGAC,CCTTGGCAAAATTGGTTTGAC
TCTCTAAAAAGGTACAGTATA,TCTCTAAAAAAGTACAGTATA,TATACTGTACCTTTTTAGAGA,TATACTGTACTTTTTTAGAGA
GTGACAAGGTAGGGGTTG,GTGACATGGTAGGGGTTG,CAACCCCTACCTTGTCAC,CAACCCCTACCATGTCAC
CACGGTTTACACTCCTATTA,ACGGTTTACACTCCAATTA,TAATAGGAGTGTAAACCGTG,TAATTGGAGTGTAAACCGT
CATCAACACAATCTGC,CATCAACACGATCTGC,GCAGATTGTGTTGATG,GCAGATCGTGTTGATG
AGGGTCTCATGCTCCCT,AGGGTCTCGTGCTCCCT,AGGGAGCATGAGACCCT,AGGGAGCACGAGACCCT
TCACAAATGTATCCTAAAGC,CACAAATGTATACTAAAGC,GCTTTAGGATACATTTGTGA,GCTTTAGTATACATTTGTG
CCAACGGCGACTTG,CCAACGCCGACTTG,CAAGTCGCCGTTGG,CAAGTCGGCGTTGG
TGCATGTAACAAATAACAT,TGCATGTAACATAACAT,ATGTTATTTGTTACATGCA,ATGTTATGTTACATGCA
GTCTCTGACGGTGTGCTTTC,GTCTCTGACTGTGTGCTTTC,GAAAGCACACCGTCAGAGAC,GAAAGCACACAGTCAGAGAC
GTACGGAAAAAACA,GTACGGGAAAAACA,TGTTTTTTCCGTAC,TGTTTTTCCCGTAC
GGTTACATCCCAAA,GGTTACACCCCAAA,GGTTACGTCCCAAA,GGTTACGCCCCAAA,TTTGGGATGTAACC,TTTGGGGTGTAACC,TTTGGGACGTAACC,TTTGGGGCGTAACC
TCGAACTCCGCTCCTAG,TCGAACTCCACTCCTAG,CTAGGAGCGGAGTTCGA,CTAGGAGTGGAGTTCGA
CCACTAAGGATTACGTTACG,CACTAAGGATTACATTACG,CGTAACGTAATCCTTAGTGG,CGTAATGTAATCCTTAGTG
TACAGATGTCATTTTAC,CTACAGATGTAATTTTAC,GTAAAATGACATCTGTA,GTAAAATTACATCTGTAG
TGGAATGGGTAAGGTGTA,TGGAATGTGTAAGGTGTA,TACACCTTACCCATTCCA,TACACCTTACACATTCCA
TCCTTCTCACGCTTCT,CTCCTTCTCATGCTTCT,AGAAGCGTGAGAAGGA,AGAAGCATGAGAAGGAG
CCCGCGGTGAGTAT,CCCGCTGTGAGTAT,ATACTCACCGCGGG,ATACTCACAGCGGG
CCTCCTGGGTATATCG,CTCCTGGGCATATCG,CGATATACCCAGGAGG,CGATATGCCCAGGAG
CATCTGGCAATGCCTT,CATCTGGCAGTGCCTT,AAGGCATTGCCAGATG,AAGGCACTGCCAGATG
GCATTTTAAAAATC,GCATTTAAAAAATC,GATTTTTAAAATGC,GATTTTTTAAATGC
CCGTGGTATTGTTTCAA,CCGTGGTATTCTTTCAA,TTGAAACAATACCACGG,TTGAAAGAATACCACGG
TGTATGACCTCTGACCTGT,TGTATGACCTCTAACCTGT,ACAGGTCAGAGGTCATACA,ACAGGTTAGAGGTCATACA
TCTCTGCTCATCTGTC,CTCTGCTCGTCTGTC,GACAGATGAGCAGAGA,GACAGACGAGCAGAG
ATCAAGCTGACGAACCA,CAAGCTGACAAACCA,TGGTTCGTCAGCTTGAT,TGGTTTGTCAGCTTG
TGACTCTCAGCATCTG,TGACTCTCAGCAACTG,TGACTCTCTGCATCTG,TGACTCTCTGCAACTG,CAGATGCTGAGAGTCA,CAGTTGCTGAGAGTCA,CAGATGCAGAGAGTCA,CAGTTGCAGAGAGTCA
AGCTCCATGCGGACT,AGCTCCACGCGGACT,AGTCCGCATGGAGCT,AGTCCGCGTGGAGCT
CTGGACCAGAACTCTGA,CTGGACCAGATCTCTGA,TCAGAGTTCTGGTCCAG,TCAGAGATCTGGTCCAG
AAGGTGCCTTCCCC,AAAGTGCCTTCCCC,AAGGTGGCTTCCCC,AAAGTGGCTTCCCC,GGGGAAGGCACCTT,GGGGAAGGCACTTT,GGGGAAGCCACCTT,GGGGAAGCCACTTT
CTCTTCGATGTCTAGACA,CTCTTCAATGTCTAGACA,TGTCTAGACATCGAAGAG,TGTCTAGACATTGAAGAG
GCACGATGCAGAAC,GCACGATGTAGAAC,GCACGAAGCAGAAC,GCACGAAGTAGAAC,GTTCTGCATCGTGC,GTTCTACATCGTGC,GTTCTGCTTCGTGC,GTTCTACTTCGTGC
GAGAGCCGAGCTTT,GAGAGCTGAGCTTT,AAAGCTCGGCTCTC,AAAGCTCAGCTCTC
CCACGTAGCGATCG,ACCACATAGCGATCG,CGATCGCTACGTGG,CGATCGCTATGTGGT
AAGTCAGCATCTTTCA,AGTCAGCGTCTTTCA,TGAAAGATGCTGACTT,TGAAAGACGCTGACT
ATGGAGGATTGTGGTTGT,ATGGAGGATTCTGGTTGT,ACAACCACAATCCTCCAT,ACAACCAGAATCCTCCAT
AAGAAGCATTTTTTGG,AAGAAGCATTTTTTGT,AAGAAGCATTTTTTTG,AAGAAGCATTTTTTTT,AAGAAGCATTTATTTT,CCAAAAAATGCTTCTT,ACAAAAAATGCTTCTT,CAAAAAAATGCTTCTT,AAAAAAAATGCTTCTT,AAAATAAATGCTTCTT
TATTGGTCAGGGAA,TATTGGCCAGGGAA,TTCCCTGACCAATA,TTCCCTGGCCAATA
CTCCCACAAACCC,TCCCAGAAACCC,GGGTTTGTGGGAG,GGGTTTCTGGGA
TCCGTTAGTTCATCCTGG,TCCGTTAGTTCCTCCTGG,CCAGGATGAACTAACGGA,CCAGGAGGAACTAACGGA
ACAGATCCATCCACCACT,AGATCCAGCCACCACT,AGTGGTGGATGGATCTGT,AGTGGTGGCTGGATCT
CTGGACGCCGTTACA,TGGACGCCATTACA,TGTAACGGCGTCCAG,TGTAATGGCGTCCA
AGTGCGAAGAACC,AGTGCAAAGAACC,GGTTCTTCGCACT,GGTTCTTTGCACT
AGCAATCCCACAGC,AGCAATCACACAGC,AGAAATCCCACAGC,AGAAATCACACAGC,AGCAATTCCACAGC,AGCAATTACACAGC,AGAAATTCCACAGC,AGAAATTACACAGC,GCTGTGGGATTGCT,GCTGTGTGATTGCT,GCTGTGGGATTTCT,GCTGTGTGATTTCT,GCTGTGGAATTGCT,GCTGTGTAATTGCT,GCTGTGGAATTTCT,GCTGTGTAATTTCT
ATGGCCCTTACACTATC,TGGCCCTTACGCTATC,GATAGTGTAAGGGCCAT,GATAGCGTAAGGGCCA
ACAGAGAGAAGTCCCAGGTG,AGAGAGAAGCCCCAGGTG,CACCTGGGACTTCTCTCTGT,CACCTGGGGCTTCTCTCT
CCAGATGAACAACTTCAC,CCAGATGAGCAACTTCAC,GTGAAGTTGTTCATCTGG,GTGAAGTTGCTCATCTGG
CAGCTGTCCAGTTCTG,CAGTTGTCCAGTTCTG,CAGAACTGGACAGCTG,CAGAACTGGACAACTG
CTAACCCGGACGAC,CTAACCCGGACAAC,CTAACCCGGAAGAC,CTAACCCGGAAAAC,CTAACCTGGACGAC,CTAACCTGGACAAC,CTAACCTGGAAGAC,CTAACCTGGAAAAC,GTCGTCCGGGTTAG,GTTGTCCGGGTTAG,GTCTTCCGGGTTAG,GTTTTCCGGGTTAG,GTCGTCCAGGTTAG,GTTGTCCAGGTTAG,GTCTTCCAGGTTAG,GTTTTCCAGGTTAG
CTGGGTCGGCGCT,TGGGTCGACGCTC,AGCGCCGACCCAG,GAGCGTCGACCCA
TAGTAGCCCCTACACCTC,TAGCCCCTGCACCTC,GAGGTGTAGGGGCTACTA,GAGGTGCAGGGGCTA
CTGGCTGTAAACGAAGA,TGGCTGTAAACAAAGA,TCTTCGTTTACAGCCAG,TCTTTGTTTACAGCCA
TAGCCGTCACCGAT,TAGCCGCCACCGAT,ATCGGTGACGGCTA,ATCGGTGGCGGCTA
CATCCTGCTGGACCC,CATCCTGTTGGACCC,GGGTCCAGCAGGATG,GGGTCCAACAGGATG
TGGAGGTGGAGGAG,TGGAGGGGGAGGAG,CTCCTCCACCTCCA,CTCCTCCCCCTCCA
CGACACCACTTACA,CGACACTACTTACA,TGTAAGTGGTGTCG,TGTAAGTAGTGTCG
CCTTCCCTCCTAGGGCAACGT,CCTTCCCTCCCAGGGCAACGT,ACGTTGCCCTAGGAGGGAAGG,ACGTTGCCCTGGGAGGGAAGG
CCTATGAAGTT,CCGATGAAGTT,AACTTCATAGG,AACTTCATCGG
TTCACGTACGGCCCAT,TTCACATACGGCCCAT,ATGGGCCGTACGTGAA,ATGGGCCGTATGTGAA
ACCCATGAATAAGGACGAGAG,ACCCATGAATGAGGACGAGAG,CTCTCGTCCTTATTCATGGGT,CTCTCGTCCTCATTCATGGGT
CATCTTAGCCTCTCTGACCCC,CATCTTAGCCCCTCTGACCCC,GGGGTCAGAGAGGCTAAGATG,GGGGTCAGAGGGGCTAAGATG
CCTGAGATTAGG,CCTGAGATTAAG,CCTAATCTCAGG,CTTAATCTCAGG
TGATCATATCTCGTTCAGT,TGATCATACCTCGTTCAGT,ACTGAACGAGATATGATCA,ACTGAACGAGGTATGATCA
TCATTTTTGCAGAGAGAGAAT,TCATTTTTGCGGAGAGAGAAT,ATTCTCTCTCTGCAAAAATGA,ATTCTCTCTCCGCAAAAATGA
AGCCAATTGTAGCCTTAGTGC,AGCCAATTGTTGCCTTAGTGC,GCACTAAGGCTACAATTGGCT,GCACTAAGGCAACAATTGGCT
GTTGGGAGCGTCCCAAAATGG,GTTGGGAGCGGCCCAAAATGG,CCATTTTGGGACGCTCCCAAC,CCATTTTGGGCCGCTCCCAAC
AGCCTCTTCCTCTCTG,AGCCTCTTCCCCTCTG,CAGAGAGGAAGAGGCT,CAGAGGGGAAGAGGCT
GACCTCAAGCAGTCAG,GACCTTAAGCAGTCAG,CTGACTGCTTGAGGTC,CTGACTGCTTAAGGTC
AGATGAACACCAACTGGCCGG,AGATGAACACTAACTGGCCGG,CCGGCCAGTTGGTGTTCATCT,CCGGCCAGTTAGTGTTCATCT
CCTGCACACATGTCAAACCG,CCTGCACACGTGTCAAACCG,CGGTTTGACATGTGTGCAGG,CGGTTTGACACGTGTGCAGG
GTGTGAAAGGGGAGAAGGGCT,GTGTGAAAGGAGAGAAGGGCT,AGCCCTTCTCCCCTTTCACAC,AGCCCTTCTCTCCTTTCACAC
AGTCTGTCGTTGT,AGGCTGTCGTTGT,ACAACGACAGACT,ACAACGACAGCCT
GCAAATCTCCGATGTAAAGT,GCAAATCTCTGATGTAAAGT,ACTTTACATCGGAGATTTGC,ACTTTACATCAGAGATTTGC
TATTCAAAAGGAGCAGTTCAT,TATTCAAAAGAAGCAGTTCAT,ATGAACTGCTCCTTTTGAATA,ATGAACTGCTTCTTTTGAATA"
for i in $PROBES
do
cat ./merged_Ots351_hash.txt | /dfs/Omalley_Lab/dayan/software/seqkit grep -s -i -p $i | grep ">" | awk -F ";" '{sum += $3} END {print sum}' >> Ots351_probe_cts.txt
done
After running these scripts combine the results in a text editor with marker names.
counts_351 <- read_tsv("results/primer_probe_counts_351.txt")
#lets get rid of the sex marker
counts_351 %<>%
filter(marker != "Ots_SEXY3-1")
# for this step it is best to look at reads containing the primer alone, as a read-hog primer can waste reads even if it amplifies off-target regions.
ggplot(data = counts_351)+geom_histogram(aes(x = primer_count))+theme_classic()
Looks like there are a handful of read hogs in there, let’s examine those first.
Here we take an arbitrary cutoff (drawn from the read distribution above) of greater than 1,800,000 reads
counts_351$portion_primer_reads <- counts_351$primer_count/sum(counts_351$primer_count)
kable(arrange(counts_351[counts_351$primer_count>18e5,], desc(primer_count)), digits = 3 )
| marker | primer_count | probe_count | portion_primer_reads |
|---|---|---|---|
| Ots_u07-57.120 | 7406103 | 44470 | 0.084 |
| Ots17_1488679_C6 | 3623890 | 63705 | 0.041 |
| Ots_103122-180 | 2690052 | 37631 | 0.030 |
| Ots17_1345774_C6 | 2622334 | 252860 | 0.030 |
| Ots_myoD-364 | 2126609 | 91817 | 0.024 |
| Ots_hsp27b-150 | 1848744 | 75133 | 0.021 |
#sum(counts_351[counts_351$primer_count>18e5,4])
#sum(counts_351[counts_351$primer_count>5e5,4])
We see here that the top six most observed primers (of 350) account for 23% of total reads.
All but one of these demonstrate highly skewed primer/probe ratios, indicative of poor specificity for these primers. These are great candidates for removal or a change to the primer sequence to improve specificity.
Failure to amplify
While some low amplifying primers can be “rescued” by excluding read hogs from future runs, primers with zero or extremely low amplification may need to be redesigned, or the in silico sequence used may need to modified to accomodate unaccounted for variants in the primer region. Such markers might show low/zero primer counts, but large numbers of probe counts. Let’s look at these.
Below we plot the read distribution overall (blue) (estimated by primer counts), vs reads from markers with less than 10% of the median number of reads.
zero_primers <- counts_351 %>%
filter(primer_count < 0.1* median(primer_count))
ggplot(data = zero_primers)+geom_histogram(aes(x = probe_count), fill = "red", alpha = 0.7)+geom_histogram(data = counts_351, aes(x = probe_count), fill = "blue", alpha = 0.5)+ theme_classic()
1 marker (Ots_108735-302) has very poor amplification (just 11k reads, when median is 120261). It also has low probe counts, suggesting this isn’t a caused by a variant in the primer.
Let’s check anyway. Below see if there are any reads in there that are very close to this primer sequence suggesting that marker dropout is due to probe region variation in new population.
#mismatches
/local/cluster/bbmap/bbduk.sh -Xmx33M in=merged_Ots351.fasta outm=Ots_108735-302.mm.fastq literal=CCTTTTTCTTATTAGTTTTACTTCCCCAGAGA k=31 edist=2
/local/cluster/bbmap/bbduk.sh -Xmx33M in=Ots_108735-302.mm.fastq outm=Ots_108735-302.mm.probe.fastq literal=AAACAAACAACGCCTCATG,AACAAACAACACCTCATG k=19 edist=2
cat Ots_108735-302.mm.probe.fastq | sed -n '2~4p' | sort | uniq -c | sort -n -r
No, there only reads that come out of this search are reads that are an exact match to primer and probe sequences.
Failure to find probe
zero_probes <- counts_351 %>%
filter(probe_count < 0.1* median(probe_count))
ggplot(data = zero_probes)+geom_histogram(aes(x = primer_count), fill = "red", alpha = 0.7)+geom_histogram(data = counts_351, aes(x = primer_count), fill = "blue", alpha = 0.5)+ theme_classic()
Also two markers with extremely low probe counts. One is already examined (Ots_108735-302), however another is Ots_wenYhap_33126. While this marker also shows low primer counts, the primer:probe ratio is extremely skewed (~60:1). Let’s look for variants with this marker as well
#mismatches
/local/cluster/bbmap/bbduk.sh -Xmx33M in=merged_Ots351.fasta outm=Ots_wenYhap_33126.fastq literal=GAAGAGTCATGTCAGGATTAGCC k=23 edist=2
#62k reads
#no mismatches
/local/cluster/bbmap/bbduk.sh -Xmx33M in=merged_Ots351.fasta outm=Ots_wenYhap_33126.fastq literal=GAAGAGTCATGTCAGGATTAGCC k=23 edist=0
#59k reads
/local/cluster/bbmap/bbduk.sh -Xmx33M in=Ots_wenYhap_33126.fastq outm=Ots_wenYhap_33126.mm.probe.fastq literal=GTAAAGCAAA,GTAACGCAAA k=10 edist=2
#233 reads
cat Ots_wenYhap_33126.mm.probe.fastq | sed -n '2~4p' | sort | uniq -c | sort -n -r
#
Nope, couldn’t find a null allele with this method.
First let’s look at the distribution of primer and probe counts alone.
Primers first (ignoring the read hogs).
# for this step it is best to look at reads containing the primer alone, as a read-hog primer can waste reads even if it amplifies off-target regions.
ggplot(data = counts_351)+geom_histogram(aes(x = primer_count))+theme_classic()+xlim(-1000, 1800000)
Probes next.
ggplot(data = counts_351)+geom_histogram(aes(x = probe_count),)+theme_classic()
ggplot(data = counts_351)+geom_histogram(aes(x = probe_count),)+theme_classic()+xlim(-1000, 500000)+ggtitle("probe count\ntruncated at 500k")
Ratios
Now let’s take a look at the ratio of those values.
# for this step it is best to look at reads containing the primer alone, as a read-hog primer can waste reads even if it amplifies off-target regions.
ggplot(data = counts_351)+geom_histogram(aes(x = log10( primer_count/probe_count)))+theme_classic()+geom_vline(aes(xintercept = 0.69897), color = "blue")+geom_vline(aes(xintercept = -0.69897), color = "red")
Most primers and probes have sufficient specificity (i.e. primer:Probe ratio between 1:5 and 5:1). Most primer:probe ratios are skewed towards MORE primers than probes. Only a handful are strongly skewed (ratio greater than 5x, 10 markers) and one is strongly skewed in the opposite direction (less than 1/5). Positive skewed primer:probe ratios can be due to several effects:
Let’s look more closely by seeing if this is skewed towards high amplifiers.
ggplot(data = counts_351)+geom_point(aes((probe_count), (primer_count)), alpha = 0.5)+geom_abline(aes(slope = 1, intercept = 0))+xlim(4,7)+ylim(4,7)+scale_x_log10()+scale_y_log10()+theme_classic()
No! It looks like the high primer:probe ratio markers (i.e. strong positive residuals from the 1:1 line in the plot above) occur across the distrubution of primer counts, suggesting these primers are could be either amplifying off-target sites or that there are bad probes/probe region variation. This warrants looking into more closely, but there’s not much more we can do with primer and probe counts alone.
high_ratio_markers <- counts_351[which(counts_351$primer_count/counts_351$probe_count >5),1]
low_ratio_markers <- counts_351[which(counts_351$primer_count/counts_351$probe_count < 0.2),1]
Here we run through the normal gtseq genotyping pipeline (w/o filtering) to call genotypes. This allows us to accomplish several QC steps:
Paralogs: Identify potential paralogs, adjust allele correction values or change in silico probe length to increase specificity to avoid paralogs hits.
Probe/Primer Variants: Similarly, we can look for loci with non-normal distributions of reads across individuals to attempt to identify if primer or probe sequences occur over common variants. We can “rescue” these markers by shortening the in-silico primer or probe sequences to avoid the variant, but then this new in silico primer/probe needs to be tested again for specificity.
note used the allele correction values shared in the OtsGtseqv6.1 panel info shared from CRITFC
#!/bin/bash
#$ -S /bin/bash
#$ -t 1-195
#$ -tc 20
#$ -N GTseq-genotyperv3
#$ -cwd
#$ -o $JOB_NAME_$TASK_ID.out
#$ -e $JOB_NAME_$TASK_ID.err
export PERL5LIB='/home/fw/dayand/perl5/lib/perl5/x86_64-linux-thread-multi/'
FASTQS=(`ls -1 ../../raw_reads/Ots351/*fastq`)
INFILE=${FASTQS[$SGE_TASK_ID -1]}
OUTFILE=$(basename ${INFILE%.fastq}.genos)
GTSEQ_GENO="/dfs/Omalley_Lab/dayan/software/GTseq-Pipeline/GTseq_Genotyper_v3.1.pl
"
PROBE_SEQS="/dfs/Omalley_Lab/dayan/coastal_chinook_2020/75bp_panel_qc/genotyping/Ots351/Ots351_probeseqs.csv"
perl $GTSEQ_GENO $PROBE_SEQS $INFILE > $OUTFILE
#save this code chunk as a file on the server and submit this with qsub -q harold scriptname from the directory you want the output .genos files
SGE_Batch -q harold -r compile -c 'perl /dfs/Omalley_Lab/dayan/software/GTseq-Pipeline/GTseq_GenoCompile_v3.pl > Ots351_coastal_chinook_run018_genos_0.1.csv'
#also collect marker info
touch marker_info.txt
for file in ./*genos
do
awk ' FS="," {print FILENAME,$1,$2,$3,$6,$7,$8}' $file >> marker_info.txt
done
Normally we would do some filtering at this step, but we want to look at ALL loci, so moving forward with no filtering.
gt <- read_csv("results/Ots351_coastal_chinook_run018_genos_0.1.csv")
marker_info <- read_tsv("results/marker_info.txt")
marker_info$a1_count <- as.numeric(marker_info$a1_count)
marker_info$a2_count <- as.numeric(marker_info$a2_count)
marker_info$total_count <- marker_info$a1_count+marker_info$a2_count
marker_info %>%
summarise(mean=mean(total_count, na.rm = TRUE), median=median(total_count, na.rm = TRUE), sd=sd(total_count, na.rm = TRUE))
gt %<>%
filter(str_detect(Sample, "negative") == FALSE) %>%
filter(str_detect(Sample, "positive") == FALSE)
ggplot(data = marker_info)+geom_density(aes(x = (total_count)), fill = "#00204DFF", alpha = 0.5)+theme_classic()+geom_vline(aes(xintercept = median(marker_info$total_count, na.rm = TRUE)))+xlab("depth per locus per individual\nmedian depth = 418 (vertical line)")
ggplot(data = marker_info)+geom_density(aes(x = (total_count)), fill = "#00204DFF", alpha = 0.5)+theme_classic()+geom_vline(aes(xintercept = median(marker_info$total_count, na.rm = TRUE)))+xlab("depth per locus per individual\nmedian depth = 418 (vertical line)")+xlim(0,2000)+ggtitle("truncated at depth of 2k per individual per locus")
## Warning: Removed 6648 rows containing non-finite values (stat_density).
sum(gt$`On-Target Reads`)/sum(gt$`Raw Reads`)
## [1] 0.4671829
Depth looks great to make GT calls. We’re also seeing the effect of some read hogs here (long right tail) and some individual genotypes with very low depth. This may be due to allele dropout or marker dropout caused by unaccounted for variation, which will be revealed soon. 46% of reads assigned to individuals were on-target.
Here we plot allele counts for all loci and examine if any markers have a skewed allele count ratio for heterozygotes
library(lme4)
hets <- filter(marker_info, called_geno == "HET" | is.na(called_geno))
models <- hets %>%
group_by(marker) %>%
group_map(~ lm(a1_count ~ a2_count, data= .))
# Apply coef to each model and return a list of allele count ratios
lms <- lapply(models, coef)
ggplot()+geom_histogram(aes(x = sapply(lms,`[`,2)))+theme_classic()+ggtitle("allele ratios for all NA and HET calls")
#list of p-values
lms_anova <- lapply(models, summary)
# collect info about each bad model
paralog_possible <- which(abs(sapply(lms,`[`,2)) > 1.5) #bad because a positively skewed allele ratio
paralog_possible2 <- which(abs(sapply(lms,`[`,2)) < (2/3)) # bad because a negative skewed allele ratio
paralog_possible3 <- which(sapply(lms_anova, function(x) x$coefficients[,4][2])> 0.01) # bad because too much variance in allele ratio, even if mean ratio is 1
paralog_possible <- c(paralog_possible, paralog_possible2, paralog_possible3)
16 markers have skewed allele ratios (greater than 1.5), or high variance in allele ratios at heterozygote GTs, indicative of a possible paralogs.
Let’s take a look at these.
plots <- marker_info %>%
group_by(marker) %>%
do(plots=ggplot(data=.)+geom_point(aes(a1_count, a2_count, color = called_geno))+theme_classic()+geom_abline(aes(slope=1, intercept=0))+geom_abline(aes(slope = 10, intercept=0), color = "green")+geom_abline(aes(slope = 0.1, intercept=0), color = "red")+geom_abline(aes(slope = 0.2, intercept=0), color = "blue")+geom_abline(aes(slope = 5, intercept=0), color = "blue")+coord_equal(ratio=1)+geom_abline(slope = -1, intercept = 10)+ggtitle(unique(.$marker)))
# we did this manually by iterating the command plots$plots[[paralog_possible[X]]] for x = 1:length(paralog_possible) and flagging any problematic markers, which we examine in detail later
After examining, 11 markers were likely tagging paralogues or need major adjustments to the allele correction values. These are plotted below
a <- plots$plots[[paralog_possible[1]]]
b <- plots$plots[[paralog_possible[2]]]
c <- plots$plots[[paralog_possible[4]]]
d <- plots$plots[[paralog_possible[5]]]
e <- plots$plots[[paralog_possible[6]]]
f <- plots$plots[[paralog_possible[7]]]
g <- plots$plots[[paralog_possible[10]]]
h <- plots$plots[[paralog_possible[11]]]
i <- plots$plots[[paralog_possible[14]]]
j <- plots$plots[[paralog_possible[15]]]
k <- plots$plots[[paralog_possible[16]]]
cowplot::plot_grid(a,b,c,d,e,f,g,h,i,j,k, ncol = 2 )
rm(a,b,c,d,e,f,g,h,i,j,k)
For a few of these markers, it appears that changing the allele correction values with the primer/rpobe sequences as-is would solve our problems (see Ots_104063-132), but for others there are multiple allele ratios among hets (see Ots_110689-218). This can be due to unaccounted for variation in either the primer or probe sequence, allowing some true heterozygotes to draw reads from a single target region, while other draw from both the target region and a paralogous region.
Ideally we could change the primer or probe sequences (either the actual sequence or the in silico sequence) to account for this variation, but that would require changing the information used to call genotypes across labs.
zero/low reads Another QC check from the genotyped data is whether there are any individual GTs that have zero/very low reads when the marker and individual are otherwise good. This indicates that variation in either the primer or probe region of the amplicon is causing allele dropout or total marker dropout. This can be run using GTseq_GenoCompile_Counts.pl, but let’s just do it in R (below)
The rationale of the work below is to depth per marker per individual across both markers and individuals. If a marker has an unusally large number of individuals with low depth (once normalized for the effect of individual and marker depth) this suggests that there is a null allele (i.e. a variant falling in the region covered by the primer and/or probe sequence)
marker_info$total_count <- marker_info$a1_count+marker_info$a2_count
#lets get rid of the negative and positive controls for this step
marker_info %<>%
filter(str_detect(ind, "negative") == FALSE) %>%
filter(str_detect(ind, "positive") == FALSE)
geno_counts <- marker_info %>%
select(total_count, marker, ind) %>%
pivot_wider(id_cols = ind, names_from = marker, values_from = total_count)
# now lets normalize the read counts by the total depth at each individual to better identify locus level outliers
ind_depth <- rowSums(select(geno_counts, -ind))
normalized_geno_counts <- geno_counts %>%
select(-ind) %>%
mutate_all(~ ./ind_depth)*100
#hist(unlist(normalized_geno_counts[,-1]), breaks = 500, main = "normalized % total reads per locus per individual")
#hist(unlist(normalized_geno_counts[,-1]), xlim = c(0,1), breaks = 500, main = "normalized % total reads per locus per individual\nreadhogs excluded")
# but these zeros/low scores could also be from bad markers across ALL samples, let's normalize across columns too
normalized2 <- as.data.frame(cbind(normalized_geno_counts[,1], (scale(normalized_geno_counts[,-1], center=FALSE, scale=colSums(normalized_geno_counts[,-1])))))
hist(table(which(normalized2 < 0.001, arr.ind = TRUE)[,2]), main= "number of individuals within a marker\ndemonstrating less than 0.1% normalized read depth")
hist(table(which(normalized2 == 0, arr.ind = TRUE)[,2]), main= "number of individuals within a marker\ndemonstrating 0 read depth")
# okay but now lets see if this just from extremely low coverage individuals
bad_depths <- ind_depth[which(normalized2 == 0, arr.ind = TRUE)[,1]]
ggplot()+geom_histogram(aes(x = ind_depth), color = "red", fill = "red", alpha = 0.2)+geom_histogram(aes(x = bad_depths), color = "blue", fill = "blue", alpha = 0.2)+theme_classic()+ggtitle("depths of individuals contributing to zero read GTs (blue), vs all depths (red)")
bad_depths2 <- ind_depth[which(normalized2 < 0.001, arr.ind = TRUE)[,1]]
ggplot()+geom_histogram(aes(x = ind_depth), color = "red", fill = "red", alpha = 0.2)+geom_histogram(aes(x = bad_depths2), color = "blue", fill = "blue", alpha = 0.2)+theme_classic()+ggtitle("depths of individuals contributing to low read read GTs (blue), vs all depths (red)")
Yes, there are some markers here that sequence well overall, but sequence poorly in a subset of individuals that otherwise sequence well. This suggests that there is some unnaccounted for variation. Are these by chance alone? Let’s see if some particular markers are over represented, suggesting that there is indeed a variant out there. The table below show the number of individuals (out of 187) where the normalized number of reads per marker and individual is less than 0.1% of the mean.
bad_table <- as_tibble(which(normalized2 < 0.001, arr.ind = TRUE))
colnames(bad_table) <- c("ind_index", "marker_index")
bad_counts <- bad_table %>%
mutate(marker = colnames(normalized2)[marker_index]) %>%
count(marker, sort= TRUE)
bad_counts
Let’s also take a simpler approach and just count the number of individuals with zero reads. Note that this could be caused by issues other than null alleles.
geno_counts %>%
summarise_each(funs(sum(.=="0", na.rm = TRUE))) %>%
t() %>%
as.data.frame() %>%
slice_max(V1, n = 5)
Only one marker shows up here that didn’t show up earlier. Ots_wenYhap_33126. I already demonstrated that this marker has a lot of NA genotype calls because of low probe counts and couldn’t find a null allele.
Not doing any hypothesis testing here, but a few markers seem strongly overrepresented compared to chance expectation. For example Ots_MetA has 19 individuals with less than 0.1% of the normalized mean read depth.
Do any of the markers with a lot of dropout individuals also demonstrate strange paralog behavior?
plots[[1]][paralog_possible][plots[[1]][paralog_possible] %in% bad_counts$marker]
## [1] "Ots_crRAD24807-74" "Ots_MetA" "Ots_crRAD76512-28"
## [4] "Ots_IsoT" "Ots7_54212944"
Yes, 5 (above) out of the 16 markers also show some evidence of being a paralog.
Are they also enriched among those with low primer/probe specificity?
#high_ratio_markers$marker %in% bad_counts$marker
#low_ratio_markers$marker %in% bad_counts$marker
No, this suggests these markers just have poor primer specificity leading to high off-target amplification, and not allele dropout due to probe region variation. In any case let’s pull a few of the worst offender’s sequence data and look for variants.
#lets get the non-normalized geno counts back as a reference
geno_counts <- marker_info %>%
select(total_count, marker, ind) %>%
pivot_wider(id_cols = ind, names_from = marker, values_from = total_count)
#now collect individuals with zero reads at one of the markers identified as low read outliers
geno_counts[geno_counts$`Ots_105401-325`== 0, 1]
geno_counts[geno_counts$`Ots_MetA`<= 10, 1]
geno_counts[geno_counts$`Ots28_11160599`<= 15, 1]
geno_counts[geno_counts$`Ots_128302-57`<= 0, 1]
#now pull the corresponding sequence data out of these samples and check for variation
#for each questiionable marker concatenate fastqs all the individuals with allele dropout
#then count the number of reads from these individuals using exact matching using bbduk (remember to set kmer size to primer or probe length)
# now do the same process but allow for edit distance of 2, are there more reads that make it through?
############# Ots_105401-325
# list of files with zero reads at this marker
cat OtsAC20SILR_0042_CGGATG_ATGTTG_S90_R1_001.fastq OtsAC20SILR_0045_AACTTA_ACTCTT_S194_R1_001.fastq OtsAC20SILR_0093_AACTTA_AGTAGG_S242_R1_001.fastq OtsAC20SILR_0118_AACTTA_CCGAGG_S267_R1_001.fastq OtsAC20SILR_0119_AACTTA_CTGGTT_S268_R1_001.fastq OtsAC20TRAR_0028_CGGATG_CGGTCC_S28_R1_001.fastq > ../Ots_105401-325_inds.fastq
#exact matches
/local/cluster/bbmap/bbduk.sh -Xmx33M in=Ots_105401-325_inds.fastq outm=Ots_105401-325_inds.primer.fq literal=GAACTGAGCGGCTGCTG k=17 edist=0
/local/cluster/bbmap/bbduk.sh -Xmx33M in=Ots_105401-325_inds.primer.fq outm=Ots_105401-325_double_match.fq literal=CCCGGACAAGATGAGACAG,CCCGGACAAGATGAGACCG,CTGTCTCATCTTGTCCGGG,CGGTCTCATCTTGTCCGGG k=18 edist=0
# 0 reads from above
#mismatches
/local/cluster/bbmap/bbduk.sh -Xmx33M in=Ots_105401-325_inds.fastq outm=Ots_105401-325_inds.primer.mm.fq literal=GAACTGAGCGGCTGCTG k=17 edist=2
/local/cluster/bbmap/bbduk.sh -Xmx33M in=Ots_105401-325_inds.primer.mm.fq outm=Ots_105401-325_double_match.mm.fq literal=CCCGGACAAGATGAGACAG,CCCGGACAAGATGAGACCG,CTGTCTCATCTTGTCCGGG,CGGTCTCATCTTGTCCGGG k=17 edist=2
# also zero here, so there are no variants within an edit distance of two causing allele dropout in these samples, despite plenty of primer matches, no close probe matches
###################### Ots_128302-57
cat OtsAC20SILR_0001_CGGATG_AATGAG_S49_R1_001.fastq OtsAC20SILR_0033_CGGATG_ACCATG_S81_R1_001.fastq OtsAC20SILR_0076_AACTTA_AAGCTA_S225_R1_001.fastq OtsAC20SILR_0077_AACTTA_AGGGTC_S226_R1_001.fastq > ../Ots_128302-57.inds.fastq
#exact matches
/local/cluster/bbmap/bbduk.sh -Xmx33M in=Ots_128302-57.inds.fastq outm=Ots_128302-57.primer.fastq literal=GGTTGCAGGGCAGAACTGT k=19 edist=0
#846 hits
#search probes are different lengths, so lets do them separately
/local/cluster/bbmap/bbduk.sh -Xmx33M in=Ots_128302-57.primer.fastq outm=Ots_128302-57.both.fastq literal=CCTGCAATACGACCAAC k=17 edist=0
#0 hits
/local/cluster/bbmap/bbduk.sh -Xmx33M in=Ots_128302-57.primer.fastq outm=Ots_128302-57.both.fastq literal=CTGCAATACAACCAAC k=16 edist=0
#0 hits
# edit distance=2 (lets start from the exact matching primers this time)
#search probes are different lengths, so lets do them separately
/local/cluster/bbmap/bbduk.sh -Xmx33M in=Ots_128302-57.primer.fastq outm=Ots_128302-57.both.mm.fastq literal=CCTGCAATACGACCAAC k=17 edist=2
#622 hits
/local/cluster/bbmap/bbduk.sh -Xmx33M in=Ots_128302-57.primer.fastq outm=Ots_128302-57.both.mm.fastq literal=CTGCAATACAACCAAC k=16 edist=2
#622 hits
# okay now lets look at what we have going on here (this command gets just the sequence lines from the fastq and then sorts and counts unique sequences)
cat Ots_128302-57.both.mm.fastq | sed -n '2~4p' | sort | uniq -c | sort -n -r
# did local alignment of the probe sequences vs the most common read (571 reads) and found this
probex GTTGGTCGTATTGCAGG
probey GTTGGTTGTATTGCAG
segment of read GTTGGTCATATTGCAGG
########### Ots28_11160599
cat OtsAC20SILR_0014_CGGATG_GTATCC_S62_R1_001.fastq OtsAC20SILR_0026_CGGATG_ATCAAA_S74_R1_001.fastq OtsAC20SILR_0036_CGGATG_CTTATG_S84_R1_001.fastq OtsAC20SILR_0037_CGGATG_GCGCTG_S85_R1_001.fastq OtsAC20SILR_1001_CGGATG_GCTCAA_S93_R1_001.fastq OtsAC20TRAR_0026_CGGATG_AGCGCA_S26_R1_001.fastq OtsAC20TRAR_0034_CGGATG_AGGGTC_S34_R1_001.fastq OtsAC20TRAR_0036_CGGATG_CGTCTA_S36_R1_001.fastq OtsAC20TRAR_0042_CGGATG_AGGTGT_S42_R1_001.fastq > ../Ots28_11160599inds.fastq
#exact matches
/local/cluster/bbmap/bbduk.sh -Xmx33M in=Ots28_11160599inds.fastq outm=Ots28_11160599.primer.fastq literal=GTGCATATTTTACGTGGTTGAAGT k=24 edist=0
#200 hits
#search probes
/local/cluster/bbmap/bbduk.sh -Xmx33M in=Ots28_11160599.primer.fastq outm=Ots28_11160599.both.fastq literal=CTCTCTGCTTTCGTT,CTCTCTGCTTGCGTT k=15 edist=0
#115 hits
#edist =2
/local/cluster/bbmap/bbduk.sh -Xmx33M in=Ots28_11160599inds.fastq outm=Ots28_11160599.primer.mm.fastq literal=GTGCATATTTTACGTGGTTGAAGT k=24 edist=2
#483 hits
#search probes
/local/cluster/bbmap/bbduk.sh -Xmx33M in=Ots28_11160599.primer.mm.fastq outm=Ots28_11160599.both.mm.fastq literal=CTCTCTGCTTTCGTT,CTCTCTGCTTGCGTT k=15 edist=2
#127 hits
#for this one, not explained by primer probe variation (with edit distance less than 2)
This mismatch tolerant alignment approach found that, indeed some markers are dropping out because of unaccounted for variants. For example Ots_128302-57 sequences well across most individuals, but 4 samples had zero reads at this marker despite otherwise genotyping very well. When we pulled the reads that matched the primer and probe sequence using a search algorithm that was tolerant of mismatches we found a single SNP in the probe region of these individuals (open code chunck below to see alignment )
probe x GTTGGTCGTATTGCAGG
||||||*|||||||||
probe y GTTGGTTGTATTGCAG
|||||||*||||||||
segment of read GTTGGTCATATTGCAGG
We could rescue these individuals by adding this base ambiguity to the probe sequences. The current probes are CCTGCAATACGACCAAC and CTGCAATACAACCAAC (note that the reverse complement is the actual match), we would need to change the in silico probe sequences to CCTGCAATA[TC]GACCAAC and CTGCAATA[TC]AACCAAC.
Let’s quickly test this
bash
export PERL5LIB='/home/fw/dayand/perl5/lib/perl5/x86_64-linux-thread-multi/'
GTSEQ_GENO="/dfs/Omalley_Lab/dayan/software/GTseq-Pipeline/GTseq_Genotyper_v3.1.pl"
BAD_IND_fastq=raw_reads/Ots351/OtsAC20SILR_0001_CGGATG_AATGAG_S49_R1_001.fastq
echo "Ots_128302-57,C,T,CCTGCAATA[TC]GACCAAC,CTGCAATA[TC]AACCAAC,GGTTGCAGGGCAGAACTGT" | perl $GTSEQ_GENO - $BAD_IND_fastq
Yes, adding the base ambiguity code rescued this individual’s genotype: called genotype is an A1 homozygote
Ots_128302-57,C=38,T=0,380.000,GG,A1HOM,0,0,38,74.5,100.000,Pass,1
I wonder if this also causes heterozygotes to be miscalled as homozygotes. Let’s genotype all samples just at this marker and compare results.
#run this in folder for just this marker
#!/bin/bash
#$ -S /bin/bash
#$ -t 1-195
#$ -tc 20
#$ -N GTseq-genotyperv3
#$ -cwd
#$ -o $JOB_NAME_$TASK_ID.out
#$ -e $JOB_NAME_$TASK_ID.err
export PERL5LIB='/home/fw/dayand/perl5/lib/perl5/x86_64-linux-thread-multi/'
FASTQS=(`ls -1 ../../raw_reads/Ots351/*fastq`)
INFILE=${FASTQS[$SGE_TASK_ID -1]}
OUTFILE=$(basename ${INFILE%.fastq}.genos)
GTSEQ_GENO="/dfs/Omalley_Lab/dayan/software/GTseq-Pipeline/GTseq_Genotyper_v3.1.pl
"
echo "Ots_128302-57,C,T,CCTGCAATA[TC]GACCAAC,CTGCAATA[TC]AACCAAC,GGTTGCAGGGCAGAACTGT"| perl $GTSEQ_GENO - $INFILE > $OUTFILE
# also submit this when done
SGE_Batch -q harold -c "perl /dfs/Omalley_Lab/dayan/software/GTseq-Pipeline/GTseq_GenoCompile_v3.pl > Ots_128302_genos_0.1.csv" -r compiler
Ots_128302_gt <- read_csv("results/Ots_128302_genos_0.1.csv")
Ots_128302_gt %<>%
rename(`Ots_128302-57_ambiguity` = `Ots_128302-57`)
geno_compare <- right_join(select(Ots_128302_gt, Sample, `Ots_128302-57_ambiguity`), select(gt, Sample, `Ots_128302-57`), by = ("Sample" = "Sample"))
#geno_compare[geno_compare$`Ots_128302-57_ambiguity` != geno_compare$`Ots_128302-57`, ] %>%
geno_compare %<>%
mutate(change = case_when(`Ots_128302-57` == `Ots_128302-57_ambiguity` ~ "no_change",
`Ots_128302-57` == "TT" & `Ots_128302-57_ambiguity` == "CT" ~ "miscalled het",
`Ots_128302-57` == "CC" & `Ots_128302-57_ambiguity` == "CT" ~ "miscalled het",
`Ots_128302-57` == "00" & `Ots_128302-57_ambiguity` != "00" ~ "rescued",
TRUE ~ "other_change"
))
geno_compare %>%
count(change)
Of the 187 samples, 20 (11%) are heterozygotes miscalled as homozygotes because of this null allele and 11 are tossed as missing even though good data is available to make a call.
I think it’s worth having a discussion about how to deal with issues like this.
Here’s another example
#!/bin/bash
#$ -S /bin/bash
#$ -t 1-195
#$ -tc 20
#$ -N GTseq-genotyperv3
#$ -cwd
#$ -o $JOB_NAME_$TASK_ID.out
#$ -e $JOB_NAME_$TASK_ID.err
export PERL5LIB='/home/fw/dayand/perl5/lib/perl5/x86_64-linux-thread-multi/'
FASTQS=(`ls -1 ../../raw_reads/Ots351/*fastq`)
INFILE=${FASTQS[$SGE_TASK_ID -1]}
OUTFILE=$(basename ${INFILE%.fastq}.genos)
GTSEQ_GENO="/dfs/Omalley_Lab/dayan/software/GTseq-Pipeline/GTseq_Genotyper_v3.1.pl
"
echo "Ots17_1488679_C6,G,T,TCCGGGCTCTGGCTGGG[GC][CA][CA][CA],TCCGGGCTCTTGCTGGG[GC][CA][CA][CA],CAGGTCTCTCCAGAGATGTTTGAT,0,0"| perl $GTSEQ_GENO - $INFILE > $OUTFILE
# also submit this when done
SGE_Batch -q harold -c "perl /dfs/Omalley_Lab/dayan/software/GTseq-Pipeline/GTseq_GenoCompile_v3.pl > Ots17_1488679_C6.csv" -r compiler
null_fixed_Ots17 <- read_csv("results/Ots17_1488679_C6.csv")
Accomodating this null allele produced 556864 on target reads, compared to about 60k without.
fuzzy matching
Rather than take this one marker at a time approach, the gteq genotyper script has some built in tools to check for null alleles. it uses perl fuzzy matching to identify homozygotes where more than 80% of the total reads come from a fuzzy match to the probe, rather than an exact match. These are then flagged. the gtseq error report script then summarizes how many individuals are flagged for fuzzy matches/null alleles at each marker and provides the probes sequences as well as the most common fuzzy match.
null_report351 <- read_tsv("results/null_report351.txt")
Let’s focus only on common null alleles (>5%, present in at least 10 samples), and check for overlap between this approach and the ones used previously
null_report_markers <- null_report351 %>%
filter(Samples_Flagged > 9) %>%
pull(Locus)
null_report351[(null_report351$Locus %in% bad_counts$marker),] %>%
filter(Samples_Flagged > 9)
Yep.
We really don’t benefit from doing this on the Ots351 dataset alone, only run for Ots356
For the samples prepared with the additional RoSA markers we complete all of the analyses above a second time. This will provide a second, somewhat indepedent, set of data to confirm the results above, and will allow us to examine the impact of the additional markers on the panel and check for marker interactions.
#add additional markers to probe sequence (Ots37124-12267397 Ots37124-12272852 Ots37124-12277401 Ots37124-12281207 Ots37124-12310649)
ots356_probeseq <- read_csv("metadata/Ots356_probeseqs.csv", col_names = FALSE)
# write reverse complements for these
ots356_probeseq %<>%
filter(str_detect(X1, "Ots37124")) %>%
mutate(rc1 = insect::rc(X4)) %>%
mutate(rc2 = insect::rc(X5))
# added manually to the primer and probe list used for primer and probe counts because there are no base ambiguities to deal with.
Now we count primer hits in the hash file
#!/bin/bash
#$ -S /bin/bash
#$ -N primer_count
#$ -cwd
#$ -o $JOB_NAME.out
#$ -e $JOB_NAME.err
PRIMERS="GGTCTTGCAGTCAGGAGAGG
AACCCTATGGGAACTCGTAGAACT
TGCCGCTGGATTTATTGACA
GTATAAACTAGAGTCCAGTGTTATGTTAATGTCTT
GCATTACTAAAAACTGGTGTGTGGAA
GCCTACCAGAAAGTACCAATTGTGA
GCCCTGGAGAAGTACGTTTTAAACTAA
CTGCATGAACGTTAACTCAAATAAAAGGT
TGCAGGAAGAGTTCAGAGAAATCT
AGACAATCATGGTGTTTTGAGTCTTTCT
CAAAGTGCAGGTGCTGGC
TGCAGGCATCATGCTTAATAACT
TGCAGTGCTGAATTAGAGATTAATTTTTGTG
CATTCCATGACAATGATTGAAATCTAAAAACAC
CGTGGAGTAGGTGGTTACAGTTTAT
GGATGGTTGTCATTTCTCTGCAAA
GGTGATTTTGCCACAGAGTAGAGAT
CACTAAATATTCCTTATCATTTCATACTAAGTCTGAAGAA
CGCGAGTTAGCTCGAATATTATGATTTC
GCAGGGCGCAAAGTTCTT
CAGATGAAAAATAAATAATTGGGCCATTAGGAA
GCATGTAACACATTATTTGGCATATGTACT
CCTTTTTCTTATTAGTTTTACTTCCCCAGAGA
GAAAAAGTAAAGTAAAAGTAAAGTATTATACCACTAAAGACAAT
GCCATGGAGGACTGGATGA
CAAGGGATGTGACAAATTAATCAAACACATAA
GTTTGGCTATTGAAATTATACATTAAAACATGTAGCT
CCGTTGCAGGACTCATCAGT
TGTCATCTCTATTGCAATCTCAGTAGATTTCTAT
TGCCATCATAAACAACCTAACAAGTAACT
TGGAGAACTTGCACTGAATGTGAAA
GCCTACTGATAAATGTATGACAGTAATGGA
CAGGTCTGGTCTACATCGAACAC
TCAAAGACATCGAACACAAGAACGA
AGTTGTTCTTTTTATATTGTGTTTTTATTCCATTCCA
GTTGGCTCCTTCAATTCAATTTGGA
AGCTAGGCTGTAAATGCAAGGAT
TTGTGGAATTACACCTTCAGAGTTCAAT
CATTTCCACGAAAAGCCAGATGAC
TTTTCTAGGACAGGTTGCTTGCA
GTCAACAAATGCAGGTAACATAAATGGT
TGAAATAAATTGTTCTGTTGATATGTGAATTTTGGA
TTTTAAAAATGGAGATAAACTCCTGACCTGAA
TGAGCGAGATTTATCAAACTGTCAAAGA
TGCAGGGTTGGGGACAATT
CAGATGATAGCTTCAGTAAGTGGTTCA
GGTAGGCCGTCAGTGTAAAATAAGT
AGTACAAGTGCAGAGAATGACATCATG
CACTTTTGACTTTACATGGAACTTAACTCAT
TGGGACAGAGGTGGGAATTGA
GTGTGTGTGTGTGTGTGTCATCGT
GTCGTTTTTCATAGAAAATAGCTCACAGTT
TGCAGGTGGGACTTAAACACA
TGCAGGACCAACTTTCTCAT
TGCAGGAGAGCAGGGTAGA
TCTCCTGAACTAATTTAGACCTCTGAATGT
TTTGTGCGTAAAGTCAGGTAGTGT
GCCGAAAAATAAGCGATTAGTGATGA
TGCCACCTCAGTTTTAGTGTTATATCC
CATAGTATAGTGATTCGAGTCTGGAGTCT
CACAGGAAGGACGTGTTTTGATG
GCAGGGGCAGACTGAAGG
GATCATTTATCAAGACTATAGGCTATGGATACG
TGCATCCATTCATACCTGACCAATT
CACTTGTTCTGCACACTACTTGTC
GCCTAGGTATGTACGAAACTTCACA
GGTGAGCCATTCATAACAATCTT
GAGGATGGATGAGACTTTTCAGAT
AAACCACGGTATCCTTTATTCATC
GGCTTGCCTTTAGATAGAATCTTG
GGCTTTCTGATGATCTTGAACTTT
CCCTCCAAAAAGAAAACATTTGAT
AGATGATATGGATTTGCTGTGTGT
AACACAGATCAAATGTTTTCACAC
TGAGGTATTACTTGCTGAGTTTGC
GCGAGTGTTAAAAGGGTCAAA
AGCGAGGCTTGCGTTTTACT
AAGGCCGTGAACATCTGTG
GCGGTGGGATACCTCCTCTA
GCAATTACCCATGACTCTGTGA
CATTTAGCAGACACTCTTATCTTAGTGTCA
GTCACGATCAGCAATGAGACAG
ATAATGTTTCCTGACATGTTGGTG
GAAGAGTCATGTCAGGATTAGCC
CTAGTCCTGACATGTAGGCTCATC
ACTCCTGGACATTCTGTGGTG
AAAATAAACAGAATGGAGTACCATGA
CAGGTCTCTCCAGAGATGTTTGAT
TAGTCCCACTGTACATGCCTTAGA
CCCAAGTGGTGAGTGTCAGT
GACTCAGGTAAGGAAACATCAATGTCA
TCATAAACATGGTGTCTTTCAGTCAGTT
TCGCTAGGCAGAAATATAGGGTTCT
CCTGAACAGGTACACACAAACGA
CTACGCGAGAAATAACACTTTTCAAAACT
GTGGGTAATCGATGCCAAAGAGAT
TGTTTTTGGTCATGTATTTTCTCTGCTATTTTT
CCTGAGCATCCCAGTTGAACT
GCGTTACTGGTGTTATAAACGTTAGC
GTTCTTTTTAATGATGACTACAGGTCTTTCAC
CCTGGTCTGTTTGTGATCAAGATG
CTGAGCTTTTTCAACTTACTTGTTGGA
CTCGCCTCTGTCATTGTATTACCTT
TTGTTCAATGGGCATTAATGCATGTT
GCCCTTGTGACAATGCACTGTTATA
ATCCAAGGAGCCCCATTAAAGATTT
AGAGCATTCAATTTAAAAGCTGAAAACGA
CTGCTGGCGCAGACATG
CATGTTACCCAGCTAAAAGTCTATAGCA
GTTCGTGGGATTGTTCAATGTTCAT
GGCCACTGTCATAGAATTAGGCATT
GCAGGGACAGGGCCCT
GCTCTTGCCCATCTGTAGGAT
CACTGAACTGTAAGCCATTGTGATT
GGAAACCAGCTAGGATTCAGGAA
TGCTTCAGTGAAAATAAGCGTGAGA
AGTCAGTGTTGGTGTAGTGAAGAGA
CAATTACTCTTTCTCAGCCCTGTGT
ATGGGTTGGGATTATGGTTCATTGT
GGAGGTGTAGTGAAATGGGAAGAT
TGCAGGAACTTGCTATGCT
GGAGAACATGCATCACCATTCAAG
TGCAGTTACAAGCCTAAGACAATCT
CGTGTCCGGCTTCTTTTATTTCATT
CCGTCTGAGTAGGAGGATCAATACA
GGAGGCTCTACGTAGGCCT
GAGGTTTGTTACTGTCACCCATAGA
GGGCCACGGGGTTGTAAA
GGACAAGTTGAAACAGATCAGGAAGT
ACTCTCCCAGAAGGATTCAGAGA
GCTGACCACCGACCACAG
AATATTGGCTTTCTGAGAATGCATTTGG
TCTGAACTCACCAAAGGAACACTTG
TGTGTACATCCGCGTAAATATTGAAGATAA
GAGACAAAGGTTTGCAGGTTCATG
TGGTGAGAGCAGCTTTAAATGTCTT
ACTGCAGGCGTCATGCTT
TGATTGTCTCATGGCCAATTGTCA
TGACAGATTTCACCTTTAACTAGCTAAGC
GCAGGGAAAACTGGTCAGGA
TCTTTGATATTGAGCTCATAAAAGCAAGGT
GTCTCTCTCTCTTTGCATCATTACACT
TGCGCGTCTCATTCAACCAT
TGTTGTCTCGGACTGCATGAC
GGATGACTCCTACTAATAGACGGATGT
CTCTTGCTACTTGCAGTGTATCTCA
GGTTGCAGGGCAGAACTGT
CCTATTTTTGATAGGTCATAGTGAATGGGATAG
TGCAGGGACGGGGCT
GCTGTGATTGTGCTCTAAAGACATG
TGCAGGAGCTGTGATGGG
CAGGAACCTGCTTTAATGCTCT
CCTTCAAACTAACACATCATAGACATGCTT
TGCCGTGAGAAACTGGTCA
ACTTCTTGAGCCAATCGGATGATG
CTTTTCTGAATTAGTGCTGTGCTTGT
TGGGATAGAACAGGAGCTTAAACA
GTGACAGGAGACAGAAAGAGACATT
AATGGGTAACAAAGAAATAGCTAGCTACTT
ATGTCAATATATTTCACTATAATGATTGGAAGCCA
CCGACGCCTCACTGAGT
CGGTCATTGTAAATGTCAACGGTTT
GCTGAGGAAGGATTCTGTATTTGCT
ACTCTATCATCGGCAGGACCAT
TCAATGTTCATCAATGCACTTCCTGTA
GCCCTGCCTGCAACTTC
TGTTGTAATCTTTCTGAATATTTGCTTGCTT
CTCACTGCAAATCCAACTTCATCAT
GGTCTGTCTGTCTGTCTATCTGTCAATG
AGTGCAGGTCTCCAGATTTACA
GAGTGGTCAAGGTTTCAGTTTCTG
TTCCCTAATCTGACGTACTACCAACT
CTGGTCTGTGACGTCAAAATGATG
GTGATGAGAGGTTTCCGGAAAATCT
CCAGCAGAGACTGGGTTCAC
GCCTCACATTTTACTGATGTCACTTC
GAAACGTCTATGCTGTCCCCTTTAA
TTCTGGGTTGCCATACTCTTTCAAT
CGGAAGACCAGATTCTCCAAGAGTA
CAACGCGGGAATGGCTTTTAA
GAACTGAGCGGCTGCTG
CGAAATAAGGGCCTGGTGTTTAAAA
AAATGAGGCCGTCCTTTACACT
ACACATGGCTCGTCTGCA
GAGGCCTAATGTCTCTTGTGACT
GGGAGGCAGGCAAAAGGT
CGCTGGGCATGGATGAGT
TGCATGTTTTCTAACTGTGTTTTTGTGT
CTTTTCGGGTTATTCATGCTGTTGT
CTCATACTTTGTACCTGTGTGTTCCA
CCAAATACAGACCAGCTACTTGTGT
ACCAGTACCTAAACGTTAGAAAGCAA
CTCTTGTTTGCTATGGGAGATGTAGT
TAGGAGTTGGAAAGACTGCACA
GCCAAGTGATCAAGTGCTTGT
GAGGCTGACTTGGACTTTGC
CAGGCTTGTGTTAAGTAGGGAGAAA
CATAGATGTTTATATGAAAAACCTCCCACTGT
ATTTTCAAACAGGCATTTATCATTGGTGAA
CCCACTTCCAGAGCCTGAA
CAAGGGCACATTGGCAGATTTT
CGGACAAAGAGCTACAGAAATGC
ATGACCAATTGAAGAGTTCTTCCGT
GGACTCGTGCTTGAGGAAGATG
GTCCTCAGCTGGGTCAAGAG
CTGTTAGTGCAGAAGACGTAGCT
GGCCATCTTTCAGGACGTACAG
GGTGCCACTTTAGTATAGCTGCTTA
ACTGCCACAGACACGAACTC
CGATGTACTGAGGGCAGTGT
TGCACCTGCGAGAGCAT
TGTGTTTAGGATTGAACTGACCATGTT
ACACCCACTTCAACCTCCATAAC
TGTGTATTCGTCGACCGGA
GCCTACAGCAAATTCAGCTACACAT
GCCATATCCCGGGGCTTG
TCATCAAAACATGCCTCTTCTGTGT
CAGCCCGTCCCAAAATCAAG
GGCACAGCGACAGGAGTT
GCATGGCTGCCCTAGAACA
GCCAGATAGTAGCGTACATCATGAG
GGCACTCTCCCTGGCTAGA
GAGGATGACACTGTCCGTTTGT
GCAGGGTCTGTGTGGGTT
TGCAGGAGGAGGAAGGCA
AACAAAGAATGTTAAACACCAAACAGGAA
CTGCTTGTAGCCGTTCAGC
CTCAACAGTGCACCTCCCTTAATT
GCCATTTGACCAACGGAGC
TCTTTGGACTGTGTATACCAGGTGTA
CACCTTAGTTCCACGCAACATG
CCCCATATGAGACGCTACAGTAATG
ACAGAGCTGTGTCTACCAGA
GCCTGCCCTACTTATCTCTTATCA
CCCCAAAAACATCAAGAAGTCTAA
TGCAATATAGAACAAATCCGAAAA
TTTATATTCAGACATTCGCCAAAA
CCAATGGTGATTTTAGAACCATTAC
TGTGCGGAATTACTGATAATTGAC
CAAATGTAAGGATACGCTTGAATG
CTTATCTCAAAGGAATGGGAATGA
TCAAATTGAATGTAGACAGATGGAA
AACCATTGTTCTTGTATTCCTGCT
ACAACTAGTCATCGTGGAATCTGA
GAGAGAGTGCATTCTTCATCAAGTT
CTCTTTCAGTTGTCTTTGCTCTTG
TCTCAATGTGATTGAAATGGATGT
CATGAGACACCCTGGAGAAAA
GCACTGTATACAAAATCGTGTGGT
GCTAAATGTAAATCGAGTGGCTGT
GTTTTGCCAGAGAGAATGTACAAA
CCCACCATACAATAAAGGCATGT
AGCGCCTGTTTTACATAAACACTT
GTATGAGTTGTGTGGTTGCAATGT
GTAAAGAAACATGACCTTTTCTGAG
GTGCATATTTTACGTGGTTGAAGT
GCTGCTATTTCCGACCTTACAATA
CTTCCAGGAGGTATTGTTGGTTAT
AGAACCCATGCTTTCAGTACACTT
GTTCATTTTGAAATAACTGCATCG
AAATCACCCCATTTCTTTTGTG
TTATTTTGGGCTTCATATGGTTCT
AGATCAAGCTTGCTGACTTCG
CACAAATGTGACCGTTTTCATC
TGATATATTTTGCTGCAATGATCTG
AAGGAGCAGGAGATGTTATTGAAG
CTGACAAAAGTGATCTGCCTGA
GCTTAAACAGCTGCTATTAGGACA
GGGAGACTTAAAACAACCTCAAAA
TGATTTGACTTTTTGTGGTGTTTT
ATTAGTGCATATGAATCGGGCTAT
GAACTATCCTGACTCCCATTGAAA
ATAGAGCTTTTGGTGTTTCATTCC
TGCGAGATTTATCTACTTGTCCAG
CCTAAGAGGAGACGAGCATTACAG
AGAAAGCCATCATCATGAGACC
TCAAGACTGTGCTGTAGTTGTCTAC
CGCAAGTCAGCAGGGTGA
AAAATATGTGCAACATCCAATGTC
ATTATTCAAACAGAGATGGCGAAA
GGTTTGAGCCAATCAGTTGTGTT
GGCTCGAACCACCCAGTTTA
CAAACGCGCACTCACACACA
CCAGAGGTTAGATGGCCCTTT
CCTGAACAAATACTTAACGCTCCAGTT
GTGCTGCAGGAACCATGTG
AAGGTCTACTCCGGTTGTATTCGGT
CCATTCCCATCGGCATCGT
CAGGCAGTCACTGAGTCCG
CCTCTGCTGAGTTTGAGGGG
CAAGAACACCGAGATCTCCTTCA
TAACCATGACTTCTATCAATCACCCC
CCACTGGCTGTGGAGCTT
CAATGTCTAAAGTAATGGTGGTATTCTTGC
CCTTTGGGTCTGCTTGAGGTT
ATCGAGGATGCCTCAAAGACATC
GACTGTCTTGGAACCGTTGCTA
TGGGACCCACATAAAGCAACTG
CAGATGGTGCAGGCCGAA
CCAGCCCCGTAACACACAT
GCACCGTATCAACGAGCTCAT
CCTTAGCTGCTCTTTGAAGTTGACT
GGTGATAACAGGTGTTGCACCAA
GCGTACTGAGCCTGGATGACA
CCTCCAGATGAGACCCACTCT
CAGACAGGTCACCATCACACT
TGCCTAAACACTCCCAAGGT
GTCGATTACCGTTAGCTTCATCCT
GCAGGAAGCAAAGTTCGGTG
CAGTTCGCTTCTCCAGGGA
CTGCTCACCTGCATCAGTGT
ACTCTGGGTCCAGGAGGTTTT
TGAAAGATATCAATTGTAGTAGTGGTGGTG
TCGTGGATTGTGGCTTACGT
GAATGCAGGGCCAGGGAG
CAAATCAGAACAAAACCTCCCACAA
TCTCCCTCATTCCCATGTCATATCA
CTCCCCCCTGGACTTTGG
CGTGGTGTTCGCCTTCCT
TTTGAGTGAGTCACTGCACCAA
CCTGCTCTGTGTCTGGGC
CCGCCTTTCCCACCTTCTC
CTCTGCCATTCATTTGGGCTTTG
AGGTCCTCTGTCGCACCTA
TTTCTCATCCTTCTCTCTTCCAGTCT
CGCAATGAGCCAACCCCT
GGAACTTCCTCTCCCGTTCTG
CCATACAGCCAGTCCAGGTG
CCAAATCCTCATCCCACACACT
CTCCCTGTTCGCTAGCCG
ACCACCCACCTCCTCAGA
GGGAGAGGGAGACGTGGA
ATCAGGTCTGGGGCGACA
ACTTTGAGGACTTACTCCTGTCCT
GTTTTGGTGTGGTCTCAAATCC
ACCTTTTAGCCAGTGACAACATTT
TTTTGGAACCCTTTTTACTACGAG
ATTTGCTGTGTGTGGAGTGAAT
CAGCAGCTGTTTATGACTGACTTC
TTAGCAGGCGATCTAATTCTGATT
ATCATCTCTGCTCAGAGGCTATTC
CAATTGTAGCCCTCTAACTTTTCC
TATACCTTTGTAGCATCCCTCTCC
TACGGTAGGAAGACTGAATGAGTG
AGTGCTCCATGCTGGAGTTT
CAGTTTAAGTGTTACCACCACGAG
TTAAATCACCCAGAGCTTGTTAGA
CATTTCAAAATTAGGAGGTTAGGG
ATGGTTAAATTGACTCCTCCCTAT
CACATGGCTCTTTGCTCAAAAT
CAGTTCCTGACATTCACCAAAATA
GTGCAGGAATATAGACTACACA
CCAGGTCAGACAGTGCTAGA
CGGTTCGTGTCCATTGCATT
CTTGAATCACATGGCTGCTG
CCACATCTTCTGGTCACTCTC"
for i in $PRIMERS
do
cat merged_Ots356_hash.txt | /dfs/Omalley_Lab/dayan/software/seqkit grep -s -i -p $i | grep ">" | awk -F ";" '{sum += $3} END {print sum}' >> Ots356_primer_cts.txt
done
#save this file as a script and submit with qsub
Now we count probe sequences
#!/bin/bash
#$ -S /bin/bash
#$ -N probe_count
#$ -cwd
#$ -o $JOB_NAME.out
#$ -e $JOB_NAME.err
PROBES="TCAGCGAAGTGGAGAT,ATCTCCACTTCGCTGA
ATTAGAACTCGTAGAACTAT,ATATTAGAACTCGTATAACTAT,ATAGTTCTACGAGTTCTAAT,ATAGTTATACGAGTTCTAATAT
TAGCTCCGAGCTAA,TAGCTCTGAGCTAA,TTAGCTCGGAGCTA,TTAGCTCAGAGCTA
CACCAATCAATTAATTATT,ACCAATCAATTCATTATT,AATAATTAATTGATTGGTG,AATAATGAATTGATTGGT
TGACCTGAAAATATATATTTTT,TGACCTGAAAATACATATTTTT,ACCTGAAAATATATTTTTTT,ACCTGAAAATACATTTTTTT,AAAAATATATATTTTCAGGTCA,AAAAATATGTATTTTCAGGTCA,AAAAAAATATATTTTCAGGT,AAAAAAATGTATTTTCAGGT
CCGACACAATTTTGT,CCAACACAATTTTGT,CCGACATAATTTTGT,CCAACATAATTTTGT,ACAAAATTGTGTCGG,ACAAAATTGTGTTGG,ACAAAATTATGTCGG,ACAAAATTATGTTGG
CTAGGTGAAACTTTTTTTAAA,CTAGGTGAAACTTTTTAAAAA,TTTAAAAAAAGTTTCACCTAG,TTTTTAAAAAGTTTCACCTAG
TGTGAGGGCGGTCTT,ATGTGAGGTCGGTCTT,AAGACCGCCCTCACA,AAGACCGACCTCACAT
TTTAAGATGTAGTT,TTTAAGGTGTAGTT,AACTACATCTTAAA,AACTACACCTTAAA
CCGCTTGAAAGTTTGA,CGCTTGAAGGTTTGA,TCAAACTTTCAAGCGG,TCAAACCTTCAAGCG
CTGCCACCCTTTGA,CTGCCATCCTTTGA,TCAAAGGGTGGCAG,TCAAAGGATGGCAG
ATAAAGTGTGTTAT,ATAAAGCGTGTTAT,ATAACACACTTTAT,ATAACACGCTTTAT
CAAAGAAAATCAAAATTT,CAAAGAAAATCTAAATTT,AAATTTTGATTTTCTTTG,AAATTTAGATTTTCTTTG
CTGTATACAGTAAGAGTATTAAT,ACAGTAAGAGCATTAAT,ATTAATACTCTTACTGTATACAG,ATTAATGCTCTTACTGT
CAAAGTCAAAGATCCTATTAAA,AAGTCAAAGATCGTATTAAA,TTTAATAGGATCTTTGACTTTG,TTTAATACGATCTTTGACTT
AGAAAGTTCTAGAAATAATT,AAAGTTCTAGGAATAATT,AATTATTTCTAGAACTTTCT,AATTATTCCTAGAACTTT
CCATCCTGTCTTGTCTG,CATCCTGTCATGTCTG,CAGACAAGACAGGATGG,CAGACATGACAGGATG
CCACTACTTAACGTGCTTT,CCACTACTTAACATGCTTT,AAAGCACGTTAAGTAGTGG,AAAGCATGTTAAGTAGTGG
ACCATTTGATATAACTGCGTTAG,CATTTGATATAACGGCGTTAG,CTAACGCAGTTATATCAAATGGT,CTAACGCCGTTATATCAAATG
TTATGGCTATTATT,TTATGGTTATTATT,AATAATAGCCATAA,AATAATAACCATAA
CAATGAATACAATATCTAACCTAAT,AATGAATACAATATCTAATCTAAT,ATTAGGTTAGATATTGTATTCATTG,ATTAGATTAGATATTGTATTCATT
ATGCATTCACCTGTATTAT,TGCATTCACCAGTATTAT,ATAATACAGGTGAATGCAT,ATAATACTGGTGAATGCA
AAACAAACAACGCCTCATG,AACAAACAACACCTCATG,CATGAGGCGTTGTTTGTTT,CATGAGGTGTTGTTTGTT
TTAGTCAACTGTTGTTTTT,TTAGTCAACTGTTATTTTT,AAAAACAACAGTTGACTAA,AAAAATAACAGTTGACTAA
TGTTGGCGAAGTGGGT,TGTTGGCGAAGTGTGT,TGTTGGTGAAGTGGGT,TGTTGGTGAAGTGTGT,ACCCACTTCGCCAACA,ACACACTTCGCCAACA,ACCCACTTCACCAACA,ACACACTTCACCAACA
AGAGCATGTAGTTTTG,AGCATGTAATTTTG,CAAAACTACATGCTCT,CAAAATTACATGCT
TTTTAAAA,CTGGCATCCA,TTTTTAAACTGGCATCCA,TTTTAAAA,TGGATGCCAG,TGGATGCCAGTTTAAAAA
AGCTAGCGCTCCTC,AGCTAGTGCTCCTC,GAGGAGCGCTAGCT,GAGGAGCACTAGCT
AATGTCATATAGAAATCTAC,AATGTCATAGAAATCTACTG,GTAGATTTCTATATGACATT,CAGTAGATTTCTATGACATT
TACAGGAGATAAGGTCGCA,CAGGAGATAGGGTCGCA,TGCGACCTTATCTCCTGTA,TGCGACCCTATCTCCTG
TCTCTTATCTGAGTTCTGC,CTCTTATCTGTGTTCTGC,GCAGAACTCAGATAAGAGA,GCAGAACACAGATAAGAG
CACATAGTGTAGCTTTACTAC,CACATAGTGTAGCTCTACTAC,GTAGTAAAGCTACACTATGTG,GTAGTAGAGCTACACTATGTG
AAAACAAATCATTTTTCG,AAAAACAAATAATTTTTCG,CGAAAAATGATTTGTTTT,CGAAAAATTATTTGTTTTT
ATGTACTTTAACGATTCATTT,ATGTACTTTAACGTTTCATTT,AAATGAATCGTTAAAGTACAT,AAATGAAACGTTAAAGTACAT
CCACCGCCATCTGATA,CACCGCCGTCTGATA,TATCAGATGGCGGTGG,TATCAGACGGCGGTG
ACGGAACAAATAAGACATTT,CGGAACAAATAAGCCATTT,AAATGTCTTATTTGTTCCGT,AAATGGCTTATTTGTTCCG
TAACACATGTTGGAGGTC,AACACATGTTAGAGGTC,GACCTCCAACATGTGTTA,GACCTCTAACATGTGTT
TCCATGGAAACGGACAAT,TCCATGGTAACGGACAAT,TCCATGGAAACTGACAAT,TCCATGGTAACTGACAAT,ATTGTCCGTTTCCATGGA,ATTGTCCGTTACCATGGA,ATTGTCAGTTTCCATGGA,ATTGTCAGTTACCATGGA
CTCTACAGTATG,CTCTACAATATG,CATACTGTAGAG,CATATTGTAGAG
TGCCACATGATAATTGA,CCACATGGTAATTGA,TCAATTATCATGTGGCA,TCAATTACCATGTGG
ATCAGTGACATAAGTTGTCCA,TCAGTGACATAAATTGTCCA,TGGACAACTTATGTCACTGAT,TGGACAATTTATGTCACTGA
ATTGCCCATCTCAGAATA,AATTGCCCATCTTAGAATA,TATTCTGAGATGGGCAAT,TATTCTAAGATGGGCAATT
CAGTGTATTAGTCATTCTTA,CAGTGTATTAGTCGTTCTTA,TAAGAATGACTAATACACTG,TAAGAACGACTAATACACTG
CCCCGAAGTACTTTT,CCCGAAGAACTTTT,AAAAGTACTTCGGGG,AAAAGTTCTTCGGG
CCATTTTAATTCCA,CCATTTGAATTCCA,TGGAATTAAAATGG,TGGAATTCAAATGG
TGTTCCAATGTAAAATGTATGC,TTCCAATGTAAAATATATGC,GCATACATTTTACATTGGAACA,GCATATATTTTACATTGGAA
CTGCCTAGTTAAATAAAATA,CTGCCTAGTTAAATTAAATA,TATTTTATTTAACTAGGCAG,TATTTAATTTAACTAGGCAG
CCGAGCTTGAGTTAGGA,CCGAGCTTGACTTAGGA,TCCTAACTCAAGCTCGG,TCCTAAGTCAAGCTCGG
AGCTAGTGCTTAGCAGCTAA,AGCTAGTGCTTAGCAGCTAC,AGCTAGTGCATAGCAGCTAA,AGCTAGTGCATAGCAGCTAC,TTAGCTGCTAAGCACTAGCT,GTAGCTGCTAAGCACTAGCT,TTAGCTGCTATGCACTAGCT,GTAGCTGCTATGCACTAGCT
AGGGACAGTTTCGCAGACG,AAGGGACAGTTTCTCAGACG,CGTCTGCGAAACTGTCCCT,CGTCTGAGAAACTGTCCCTT
TCATCTTTTGTTATTTCCTTG,ATCTTTTGTTCTTTCCTTG,CAAGGAAATAACAAAAGATGA,CAAGGAAAGAACAAAAGAT
ATTTGACTTGTCTTTTT,ATTTGACTTGTGTTTTT,AAAAAGACAAGTCAAAT,AAAAACACAAGTCAAAT
CACACACATGCACG,CACACATATGCACG,CGTGCATGTGTGTG,CGTGCATATGTGTG
AACCTGTGTGATTT,AACCTGCGTGATTT,AAATCACACAGGTT,AAATCACGCAGGTT
ATGATAAT,ATGATATT,ATTATCAT,AATATCAT
CTGAATGTTTTTTTTAATCTTT,CTGAATGTTTTTTTTTATCTTT,AAAGATTAAAAAAAACATTCAG,AAAGATAAAAAAAAACATTCAG
CTGTGTGTCTAAGACAAT,CTGTGTGTCTATGACAAT,ATTGTCTTAGACACACAG,ATTGTCATAGACACACAG
AATTGCCTCATTGGGTG,AATTGCCTCATTAGGTG,CACCCAATGAGGCAATT,CACCTAATGAGGCAATT
AAACCATTTTCATTCTTTTG,CCATTTTCACTCTTTTG,CAAAAGAATGAAAATGGTTT,CAAAAGAGTGAAAATGG
GAGACTGTTGAGAC,GAGACTATTGAGAC,GTCTCAACAGTCTC,GTCTCAATAGTCTC
ATAACATCTGCAGCATTAA,ATAACATGTGCAGCATTAA,TTAATGCTGCAGATGTTAT,TTAATGCTGCACATGTTAT
GTTTGGCATAAAGT,GTTTGGAATAAAGT,ACTTTATGCCAAAC,ACTTTATTCCAAAC
CCTTAAGCATATTTCT,CCTTAAGCGTATTTCT,AGAAATATGCTTAAGG,AGAAATACGCTTAAGG
TTTCCAATGGTATAGATATGA,TTTCCAATGATATAGATATGA,TCATATCTATACCATTGGAAA,TCATATCTATATCATTGGAAA
TGGGAAGCAATCAA,AATTGGGAAGCAGTCAA,TTGATTGCTTCCCA,TTGACTGCTTCCCAATT
CATAGACAGGGGCCAT,CATAGACGGGGGCCAT,CATACACAGGGGCCAT,CATACACGGGGGCCAT,ATGGCCCCTGTCTATG,ATGGCCCCCGTCTATG,ATGGCCCCTGTGTATG,ATGGCCCCCGTGTATG
ACATTACTTTTCAAAAATATT,ACATTACTTTACAAAAATATT,AATATTTTTGAAAAGTAATGT,AATATTTTTGTAAAGTAATGT
ATAAAAAATTCTGCGTGAATG,ATAAAAAATTATGCGTGAATG,CATTCACGCAGAATTTTTTAT,CATTCACGCATAATTTTTTAT
AATATATTTTTTATAGGC,AATATATATTTTATAGGC,GCCTATAAAAAATATATT,GCCTATAAAATATATATT
AAAGCTGATTAAAAA,AAAGCTGATTTAAAA,TTTTTAATCAGCTTT,TTTTAAATCAGCTTT
TAAAAATGGTTGATATGTA,TAAAAATGATTGATATGTA,TACATATCAACCATTTTTA,TACATATCAATCATTTTTA
GACACACTCACGA,GACACACTCATGA,TCGTGAGTGTGTC,TCATGAGTGTGTC
TTCTCAAGTCCTACTCAACTG,TTCTCAAGTCGTACTCAACTG,CAGTTGAGTAGGACTTGAGAA,CAGTTGAGTACGACTTGAGAA
GTGATAGTTTGATAGTTTTAT,GTGATAGTTTCATAGTTTTAT,ATAAAACTATCAAACTATCAC,ATAAAACTATGAAACTATCAC
CTGAATCCTGTAAG,CTGCATCCTGTAAG,CTTACAGGATTCAG,CTTACAGGATGCAG
GGAGATAGTCAGGG,GGAAATAGTCAGGG,CCCTGACTATCTCC,CCCTGACTATTTCC
CCGCAACAGATC,CTGCAACAGATC,GATCTGTTGCGG,GATCTGTTGCAG
CCTCACATACTCCCTT,CCTCATATACTCCCTT,AAGGGAGTATGTGAGG,AAGGGAGTATATGAGG
GGCGGCTCGGAAAATTATTTT,GGCGGCTCGGGAAATTATTTT,AAAATAATTTTCCGAGCCGCC,AAAATAATTTCCCGAGCCGCC
ATTGTGCTTATCACA,ATTGTGCTTAGCACA,TGTGATAAGCACAAT,TGTGCTAAGCACAAT
AATAGGCCGACATCAA,AAATAGGCCAACATCAA,TTGATGTCGGCCTATT,TTGATGTTGGCCTATTT
CTGTAGTGACGCCGCAACAC,CTGTAGTGACACCGCAACAC,GTGTTGCGGCGTCACTACAG,GTGTTGCGGTGTCACTACAG
GCATAGTTTGG,GCATGGTTTGG,CCAAACTATGC,CCAAACCATGC
GTAAAGCAAA,GTAACGCAAA,TTTGCTTTAC,TTTGCGTTAC
GCCACACTGTT,GCCATACTGTT,AACAGTGTGGC,AACAGTATGGC
TGAGACACAGTAAACCTTCTT,TGAGACACAGCAAACCTTCTT,AAGAAGGTTTACTGTGTCTCA,AAGAAGGTTTGCTGTGTCTCA
TCAGCTATGTAG,TTAGCTATGTAG,CTACATAGCTGA,CTACATAGCTAA
TCCGGGCTCTGGCTGGGGACA,TCCGGGCTCTTGCTGGGGACA,TGTCCCCAGCCAGAGCCCGGA,TGTCCCCAGCAAGAGCCCGGA
CCCACACATGTGGTGACCTCA,CCCACACATGCGGTGACCTCA,TGAGGTCACCACATGTGTGGG,TGAGGTCACCGCATGTGTGGG
CCCTATTCTCCAATCCATAT,CCCTGTTCTCCAATCCATAT,CCCTATTCTCTAATCCATATG,CCCTGTTCTCTAATCCATATG,ATATGGATTGGAGAATAGGG,ATATGGATTGGAGAACAGGG,CATATGGATTAGAGAATAGGG,CATATGGATTAGAGAACAGGG
AACCAGTAGAATAACC,CAGTGGAATAACC,GGTTATTCTACTGGTT,GGTTATTCCACTG
CAATCTATCATCGACCAGC,CAATCTATCATCAACCAGC,GCTGGTCGATGATAGATTG,GCTGGTTGATGATAGATTG
TGCTAAATGGCATATATTAT,CTAAATGGCACATATTAT,ATAATATATGCCATTTAGCA,ATAATATGTGCCATTTAG
CACTCTTTATATCCACACCG,CACTCTTTATATCCACACCA,CAGTCTTTATATCCACACCG,CAGTCTTTATATCCACACCA,CGGTGTGGATATAAAGAGTG,TGGTGTGGATATAAAGAGTG,CGGTGTGGATATAAAGACTG,TGGTGTGGATATAAAGACTG
TCTTGCAATCATTTTTAAC,CTTGCAATCATATTTAAC,GTTAAAAATGATTGCAAGA,GTTAAATATGATTGCAAG
AAGCGACTTGATTATC,AGCGACATGATTATC,GATAATCAAGTCGCTT,GATAATCATGTCGCT
ATGTCTGAAATGAAAGCC,AATGTCTGAAATTAAAGCC,GGCTTTCATTTCAGACAT,GGCTTTAATTTCAGACATT
TCCTGAAAAACGACATCC,CTGAAAAACAACATCC,GGATGTCGTTTTTCAGGA,GGATGTTGTTTTTCAG
CTTTCGTCCTTAGCACATAG,CTTTCGTCCTTAACACATAG,CTATGTGCTAAGGACGAAAG,CTATGTGTTAAGGACGAAAG
TATCTGGGCGGGCTG,CTATCTGGACGGGCTG,CAGCCCGCCCAGATA,CAGCCCGTCCAGATAG
ATGTATTGTTCATTTAATG,TGTATTGTTCGTTTAATG,CATTAAATGAACAATACAT,CATTAAACGAACAATACA
TAGCGCAAACCCCGAACC,CGCAAACACCGAACC,GGTTCGGGGTTTGCGCTA,GGTTCGGTGTTTGCG
AAAATGTACCACATACTTGT,AAATGTACCACATACTCGT,ACAAGTATGTGGTACATTTT,ACGAGTATGTGGTACATTT
AAGAGTCCAGCGTTACTT,AAGAGTCCAGTGTTACTT,AAGTAACGCTGGACTCTT,AAGTAACACTGGACTCTT
AAGTAACGTATCAAATGGC,AAAGTAACGTATCATATGGC,GCCATTTGATACGTTACTT,GCCATATGATACGTTACTTT
CTGTGTTGAATTTAACATAAT,TCTGTGTTGAATTTAACGTAAT,ATTATGTTAAATTCAACACAG,ATTACGTTAAATTCAACACAGA
TGCATTGCTAAGACTTG,TGCTTTGCTAAGACTTG,TGCATTGTTAAGACTTG,TGCTTTGTTAAGACTTG,CAAGTCTTAGCAATGCA,CAAGTCTTAGCAAAGCA,CAAGTCTTAACAATGCA,CAAGTCTTAACAAAGCA
CCGGTTTACCGATTTG,CGGTTTACCAATTTG,CAAATCGGTAAACCGG,CAAATTGGTAAACCG
CTAAAATGTCATGTAAATAT,ACTAAAATGTCATATAAATAT,ATATTTACATGACATTTTAG,ATATTTATATGACATTTTAGT
CCTTGGATGGGA,CCTTGGATAGGA,TCCCATCCAAGG,TCCTATCCAAGG
ATGCTATTAAATGAATATTC,ATGCTATTAAATGACTATTC,GAATATTCATTTAATAGCAT,GAATAGTCATTTAATAGCAT
TAAAAAAATATAAA,TAAAAATATATAAA,TTTATATTTTTTTA,TTTATATATTTTTA
TGTAGCTAATTTTAAGTTCTC,AGCTAATTTTAAATTCTC,GAGAACTTAAAATTAGCTACA,GAGAATTTAAAATTAGCT
CACTACGGTAAGACCAT,CACTACAGTAAGACCAT,ATGGTCTTACCGTAGTG,ATGGTCTTACTGTAGTG
ATATGGTATGTAGAGGCTAGTTA,TATGTAGAGGCAAGTTA,TAACTAGCCTCTACATACCATAT,TAACTTGCCTCTACATA
ACTGTATATGTTACGTTTTC,ACTGTATATGTTAAGTTTTC,GAAAACGTAACATATACAGT,GAAAACTTAACATATACAGT
AGTTACAAGTGGTGTTTCA,ACAAGTGGCGTTTCA,TGAAACACCACTTGTAACT,TGAAACGCCACTTGT
TTTAACAAGAAAATTATACATTTC,CAAGAAAGTTATACATTTC,GAAATGTATAATTTTCTTGTTAAA,GAAATGTATAACTTTCTTG
AGACATGTAGCTATGTAGGTAA,AGACATGTAGCTATCTAGGTAA,TTACCTACATAGCTACATGTCT,TTACCTAGATAGCTACATGTCT
ATGCATAAAAGGTAATTGTG,ATGCATAAAAGGTCATTGTG,CACAATTACCTTTTATGCAT,CACAATGACCTTTTATGCAT
CATTCAAAAAGTAT,CATTCAGAAAGTAT,ATACTTTTTGAATG,ATACTTTCTGAATG
ATGAGAGAGTCTTTCTCTGTT,ATGAGAGAGTCTTTTTCTGTT,AACAGAGAAAGACTCTCTCAT,AACAGAAAAAGACTCTCTCAT
ATTCTTCCTCTACAATTG,ATTCTTCCTCCACAATTG,ATACTTCCTCTACAATTG,ATACTTCCTCCACAATTG,CAATTGTAGAGGAAGAAT,CAATTGTGGAGGAAGAAT,CAATTGTAGAGGAAGTAT,CAATTGTGGAGGAAGTAT
TTGTGCATTTTCCCC,TGTGCATTTCCCCC,GGGGAAAATGCACAA,GGGGGAAATGCACA
CAAACCAGCAAACAT,ACAAACCAGAAAACAT,ATGTTTGCTGGTTTG,ATGTTTTCTGGTTTGT
AATTTGAATGACCA,AATTTGTATGACCA,TGGTCATTCAAATT,TGGTCATACAAATT
CTACTTATGTAGCATTTTAA,CTACTTATGTAGGATTTTAA,TTAAAATGCTACATAAGTAG,TTAAAATCCTACATAAGTAG
CTCTGCCCCTGGAC,CTCTACCCCTGGAC,CTCTGCTCCTGGAC,CTCTACTCCTGGAC,GTCCAGGGGCAGAG,GTCCAGGGGTAGAG,GTCCAGGAGCAGAG,GTCCAGGAGTAGAG
AATCCCTCCTTTTTCC,TCCCTCATTTTTCC,GGAAAAAGGAGGGATT,GGAAAAATGAGGGA
AGAGAGGGGTCAAA,AGAGAGTGGTCAAA,TTTGACCCCTCTCT,TTTGACCACTCTCT
AGACTGGTAAAAGA,AGACTGGTAAAAGT,AGACTGATAAAAGA,AGACTGATAAAAGT,TCTTTTACCAGTCT,ACTTTTACCAGTCT,TCTTTTATCAGTCT,ACTTTTATCAGTCT
TCAAAGATATGATTCAATTAA,AAGATATGGTTCAATTAA,TTAATTGAATCATATCTTTGA,TTAATTGAACCATATCTT
CAACTGAAGAAAATAATATG,CTGAAGAAAAGAATATG,CATATTATTTTCTTCAGTTG,CATATTCTTTTCTTCAG
CAGGTTAGGAATGGTTG,CAGGTTAGGATTGGTTG,CAACCATTCCTAACCTG,CAACCAATCCTAACCTG
CCTGTTATCAGACCCAAAT,CTGTTATCAGCCCCAAAT,ATTTGGGTCTGATAACAGG,ATTTGGGGCTGATAACAG
TCCCAAAGTCGAGTGTG,CCCAAAGTCAAGTGTG,CACACTCGACTTTGGGA,CACACTTGACTTTGGG
GTATATTTAGAATG,GTATATGTAGAATG,CATTCTAAATATAC,CATTCTACATATAC
TGTTCGAGAATGAAGATGAGTAA,TCGAGAATGAAGGTGAGTAA,TTACTCATCTTCATTCTCGAACA,TTACTCACCTTCATTCTCGA
AAGGCTTTGGTTGTTTG,AAGGCTTTGATTGTTTG,CAAACAACCAAAGCCTT,CAAACAATCAAAGCCTT
GAATGGTGTTAAAT,GAATGGCGTTAAAT,ATTTAACACCATTC,ATTTAACGCCATTC
CATTTACCAGTTCTCACACAC,TTTACCAGTTCACACACAC,GTGTGTGAGAACTGGTAAATG,GTGTGTGTGAACTGGTAAA
TGGTTCCCCAAATTT,TGATGGTTCCCCTAATTT,AAATTTGGGGAACCA,AAATTAGGGGAACCATCA
AGAATGAAGTGAAAAGAA,AGAATGAAGTAAAAAGAA,TTCTTTTCACTTCATTCT,TTCTTTTTACTTCATTCT
CGTATGTGCAATGCATG,CGTATGTGCATTGCATG,CATGCATTGCACATACG,CATGCAATGCACATACG
TTGGTAAACCTGTTTATTGGTA,TGGTAAACCTGTTTTTTGGTA,TACCAATAAACAGGTTTACCAA,TACCAAAAAACAGGTTTACCA
TTCTGTTACTGGAC,CTGTTACTGGGC,GTCCAGTAACAGAA,GCCCAGTAACAG
CCTGCAATACGACCAAC,CTGCAATACAACCAAC,GTTGGTCGTATTGCAGG,GTTGGTTGTATTGCAG
CTATCAAAGCAATACATTG,CTATCAAAGCAGTACATTG,CAATGTATTGCTTTGATAG,CAATGTACTGCTTTGATAG
ACAAATTAATTAAA,ACAAATGAATTAAA,TTTAATTAATTTGT,TTTAATTCATTTGT
CAGTTCTGTAATGCATT,CAGTTTTGTAATGCATT,AATGCATTACAGAACTG,AATGCATTACAAAACTG
AACTGTTCAAACCC,AACTGTCCAAACCC,GGGTTTGAACAGTT,GGGTTTGGACAGTT
TTTCTACTTAGTAA,TTTCTATTTAGTAA,TTACTAAGTAGAAA,TTACTAAATAGAAA
TCTGGCGGATTTACA,CTGGCGGGTTTACA,TGTAAATCCGCCAGA,TGTAAACCCGCCAG
AAGATGGTATGTAT,AAGATGTTATGTAT,ATACATACCATCTT,ATACATAACATCTT
CTTAGACGTCAGAGGTC,CTTAGACGTCCGAGGTC,GACCTCTGACGTCTAAG,GACCTCGGACGTCTAAG
TGTTACGGGACATACT,TCTGTTACGGACATACT,AGTATGTCCCGTAACA,AGTATGTCCGTAACAGA
AGAAGCCCAGCTCC,AGAAGCTCAGCTCC,GGAGCTGGGCTTCT,GGAGCTGAGCTTCT
CAGCACATAACTTGACCTC,AGCACATAACCTGACCTC,GAGGTCAAGTTATGTGCTG,GAGGTCAGGTTATGTGCT
ACTGGGAAGATTGTTTG,CTGGGAAGACTGTTTG,CAAACAATCTTCCCAGT,CAAACAGTCTTCCCAG
CTATAAAGTTGGACAGTTGG,AAAGTTGGGCAGTTGG,CCAACTGTCCAACTTTATAG,CCAACTGCCCAACTTT
TGAGTCCCTGACCAGC,AGTCCCCGACCAGC,GCTGGTCAGGGACTCA,GCTGGTCGGGGACT
CCTGTCTCATTCCC,CTGTCCCATTCCC,GGGAATGAGACAGG,GGGAATGGGACAG
CTTTCCCCGTGTTGGT,ACTTTCCCTGTGTTGGT,ACCAACACGGGGAAAG,ACCAACACAGGGAAAGT
CATTTTTCAGAATTGTATTC,CATTTTTCAGAATTCTATTC,GAATACAATTCTGAAAAATG,GAATAGAATTCTGAAAAATG
CTGGTACCCA,CTGATACCCA,TGGGTACCAG,TGGGTATCAG
CAGTGTCATTTTCGGC,ATCAGTGTCATCTTCGGC,GCCGAAAATGACACTG,GCCGAAGATGACACTGAT
CAGGCGGTTCTCC,CAGGCAGTTCTCC,GGAGAACCGCCTG,GGAGAACTGCCTG
ACCCACCAGTGTCATT,CCACCAGCGTCATT,AATGACACTGGTGGGT,AATGACGCTGGTGG
CCAGTCATGGGTCATT,TCCAGTCATTGGTCATT,AATGACCCATGACTGG,AATGACCAATGACTGGA
CAATCGGAAGTCGG,CAATCGTAAGTCGG,CCGACTTCCGATTG,CCGACTTACGATTG
ACGCTCGGAACATT,ACGCTCTGAACATT,AATGTTCCGAGCGT,AATGTTCAGAGCGT
ACGAGACAGATATTC,ACGAGACTGATATTC,GAATATCTGTCTCGT,GAATATCAGTCTCGT
CTGCAACTCGACGCAAG,ACTGCAACTCTACGCAAG,CTTGCGTCGAGTTGCAG,CTTGCGTAGAGTTGCAGT
ATGGGAGACAGATAACT,ATGGGAGACATATAACT,AGTTATCTGTCTCCCAT,AGTTATATGTCTCCCAT
TGGGGCAACGCACAATTGGCT,TGGGGCGACGCACAATTGGCT,AGCCAATTGTGCGTTGCCCCA,AGCCAATTGTGCGTCGCCCCA
AGCCTAGCTCTCGGAAG,AGCCTAGTTCTCGGAAG,CTTCCGAGAGCTAGGCT,CTTCCGAGAACTAGGCT
CACCACTAGAACTCTC,CACCACTAAAACTCTC,GAGAGTTCTAGTGGTG,GAGAGTTTTAGTGGTG
ATTCTGACAGCTGTTTTG,CTGACAGCCGTTTTG,CAAAACAGCTGTCAGAAT,CAAAACGGCTGTCAG
ATAGAACTACAATTCACATATAT,AACTACAATTCGCATATAT,ATATATGTGAATTGTAGTTCTAT,ATATATGCGAATTGTAGTT
CTACCGTACTGAACTC,CCGTACGGAACTC,GAGTTCAGTACGGTAG,GAGTTCCGTACGG
CCCGGACAAGATGAGACAG,CCCGGACAAGATGAGACCG,CTGTCTCATCTTGTCCGGG,CGGTCTCATCTTGTCCGGG
TGTCCTGTCCTCAGATCA,TCCTGTCCCCAGATCA,TGATCTGAGGACAGGACA,TGATCTGGGGACAGGA
CTGAGATCACTTTGAGCAC,ACTGAGATCACTGAGCAC,GTGCTCAAAGTGATCTCAG,GTGCTCAGTGATCTCAGT
GAACTTAAAACACT,GAACTTGAAACACT,AGTGTTTTAAGTTC,AGTGTTTCAAGTTC
ATTACCAACGGAGAACC,TTACCAACAGAGAACC,GGTTCTCCGTTGGTAAT,GGTTCTCTGTTGGTAA
GGGATGGAGTATTT,GGGAAGGAGTATTT,GGGATGAAGTATTT,GGGAAGAAGTATTT,AAATACTCCATCCC,AAATACTCCTTCCC,AAATACTTCATCCC,AAATACTTCTTCCC
CTACTGTTGTATTTTCTC,CTGTTGTGTTTTCTC,GAGAAAATACAACAGTAG,GAGAAAACACAACAG
AGATCATGGGAATCATAT,ATCATGGGCATCATAT,ATATGATTCCCATGATCT,ATATGATGCCCATGAT
AGACCACAAGATACAGTACC,AGACCACAAGATAGTACC,GGTACTGTATCTTGTGGTCT,GGTACTATCTTGTGGTCT
CCACCATCAAGCACTG,CCACCATCATGCACTG,CAGTGCTTGATGGTGG,CAGTGCATGATGGTGG
ATTCAAAGTCAAATTTT,ATTCAAAGTCTAATTTT,AAAATTTGACTTTGAAT,AAAATTAGACTTTGAAT
CTGCCATGAAGTGCTAG,TGCCATGAAATGCTAG,CTAGCACTTCATGGCAG,CTAGCATTTCATGGCA
ATTTGCGTCTTCTCCC,TTGCGTCCTCTCCC,GGGAGAAGACGCAAAT,GGGAGAGGACGCAA
CGATCTGGACCAGGCT,TGATCTGGACCAGGCT,CGATTTGGACCAGGCT,TGATTTGGACCAGGCT,AGCCTGGTCCAGATCG,AGCCTGGTCCAGATCA,AGCCTGGTCCAAATCG,AGCCTGGTCCAAATCA
CAGAGCATGTGCTG,CAGAGCGTGTGCTG,CAGCACATGCTCTG,CAGCACACGCTCTG
ACAGAAGATTTTCGGCTGC,ACAGAAGATTTTCGACTGC,GCAGCCGAAAATCTTCTGT,GCAGTCGAAAATCTTCTGT
CAGTTTCACTTAATTTTAAAATG,TTTCACTTAATTTAAAAATG,CATTTTAAAATTAAGTGAAACTG,CATTTTTAAATTAAGTGAAA
ACTCACACTCGAGTGACT,ACTCACACTCAAGTGACT,AGTCACTCGAGTGTGAGT,AGTCACTTGAGTGTGAGT
AGAGATGCAAAGTGGAGTT,AGAGATGCAAAATGGAGTT,AACTCCACTTTGCATCTCT,AACTCCATTTTGCATCTCT
ACCGTAGCTGCACCTG,CGTAGCAGCACCTG,CAGGTGCAGCTACGGT,CAGGTGCTGCTACG
CAATGATTAATGATTAATCCTTC,TGATTAATGATTCATCCTTC,GAAGGATTAATCATTAATCATTG,GAAGGATGAATCATTAATCA
CCGCCACCTTGGCT,CGCCACATTGGCT,AGCCAAGGTGGCGG,AGCCAATGTGGCG
TTCATAATTGAACGATTTCA,CATAATTGAACAATTTCA,TGAAATCGTTCAATTATGAA,TGAAATTGTTCAATTATG
TCTGGATGGAACCGTTAG,CTGGATGGAGCCGTTAG,CTAACGGTTCCATCCAGA,CTAACGGCTCCATCCAG
CTGGAGCGTTTCTGTA,CTGGAGCGTGTCTGTA,TACAGAAACGCTCCAG,TACAGACACGCTCCAG
TGGGTCTCGAGCCTGTA,TGGGTCTCGATCCTGTA,TACAGGCTCGAGACCCA,TACAGGATCGAGACCCA
AACGGGCATGAACGACTT,AACTGGCATGAACGACTT,AACGGGCATGAATGACTT,AACTGGCATGAATGACTT,AAGTCGTTCATGCCCGTT,AAGTCGTTCATGCCAGTT,AAGTCATTCATGCCCGTT,AAGTCATTCATGCCAGTT
CCCCCATATTGCTG,CCCCACATTGCTG,CAGCAATATGGGGG,CAGCAATGTGGGG
CCCACTTCGCTGAAGT,CCCACTTCACTGAAGT,ACTTCAGCGAAGTGGG,ACTTCAGTGAAGTGGG
CAAGAGTGGCATAAAA,CAAGAGTGGAATAAAA,TTTTATGCCACTCTTG,TTTTATTCCACTCTTG
CTGTGGTTTGTGGCGTG,CTGTGGTTTGTAGCGTG,CACGCCACAAACCACAG,CACGCTACAAACCACAG
AACATAACGGACTCCC,TAGAACATAACTGACTCCC,GGGAGTCCGTTATGTT,GGGAGTCAGTTATGTTCTA
CTCTCCTGATCACTCTGT,CTCTCCTGATCCCTCTGT,ACAGAGTGATCAGGAGAG,ACAGAGGGATCAGGAGAG
ATTAAACGTCTGGA,ATTAAACGTATGGA,ATTAAATGTCTGGA,ATTAAATGTATGGA,TCCAGACGTTTAAT,TCCATACGTTTAAT,TCCAGACATTTAAT,TCCATACATTTAAT
CATCACAACGATGTGTG,CACATCACAACTATGTGTG,CACACATCGTTGTGATG,CACACATAGTTGTGATGTG
GGGCTTGGGGGCAT,GGGCTTAGGGGCAT,ATGCCCCCAAGCCC,ATGCCCCTAAGCCC
TTTAGACTTTGCTCTATAACAG,ACTTTGCTCCATAACAG,CTGTTATAGAGCAAAGTCTAAA,CTGTTATGGAGCAAAGT
TTTCTTGTAGGCGTCAGAG,TCTTGTAGGCATCAGAG,CTCTGACGCCTACAAGAAA,CTCTGATGCCTACAAGA
CTCCTCAGGTGGGC,CTCCTCAAGTGGGC,GCCCACCTGAGGAG,GCCCACTTGAGGAG
CGTCGCATTCAGC,CGTCGCGTTCAGC,GCTGAATGCGACG,GCTGAACGCGACG
CATGAGGCGTTCGGC,ATGAGGCATTCGGC,GCCGAACGCCTCATG,GCCGAATGCCTCAT
CAGGATAATAACAAACAAG,CAGGATAATAACGAACAAG,CTTGTTTGTTATTATCCTG,CTTGTTCGTTATTATCCTG
ATCGACCCTGTCATTAG,CGACCCTGTGATTAG,CTAATGACAGGGTCGAT,CTAATCACAGGGTCG
GCACCACTGGACCC,GCACCATTGGACCC,GGGTCCAGTGGTGC,GGGTCCAATGGTGC
GAGGCCCCAGATTC,GAGGCCACAGATTC,GAATCTGGGGCCTC,GAATCTGTGGCCTC
CTACGTAATGAACGTTAGCT,ACGTAATGAACATTAGCT,AGCTAACGTTCATTACGTAG,AGCTAATGTTCATTACGT
GAACTCGTCGTTGG,GAACTCATCGTTGG,CCAACGACGAGTTC,CCAACGATGAGTTC
TCACATCCAACTCAGTACT,CATCCAACGCAGTACT,AGTACTGAGTTGGATGTGA,AGTACTGCGTTGGATG
ATAGGAGAATTGGA,ATAGGATAATTGGA,TCCAATTCTCCTAT,TCCAATTATCCTAT
TACATATGACTAATGAAA,TACATACGACTAATGAAA,TTTCATTAGTCATATGTA,TTTCATTAGTCGTATGTA
TCTATGGTGTGATTCATT,TTCTATGGTGTAATTCATT,AATGAATCACACCATAGA,AATGAATTACACCATAGAA
CTGGAAGAAGGCCTC,CTGGAAAAAGGCCTC,GAGGCCTTCTTCCAG,GAGGCCTTTTTCCAG
TTTTTGTTCAAAAG,TTTTTGGTCAAAAG,CTTTTGAACAAAAA,CTTTTGACCAAAAA
TTTGCCAAAGAGTTCAGATAC,TTTGCCAAAGTGTTCAGATAC,GTATCTGAACTCTTTGGCAAA,GTATCTGAACACTTTGGCAAA
CTACCTACCTTAGTGCTC,CTACCTACCTCAGTGCTC,GAGCACTAAGGTAGGTAG,GAGCACTGAGGTAGGTAG
CAATGAAGTTAATTTAATTGG,CAATGAAGTTCATTTAATTGG,CCAATTAAATTAACTTCATTG,CCAATTAAATGAACTTCATTG
CATTTAAAATGGTAAAAATCA,CATTTAAAATTGTAAAAATCA,TGATTTTTACCATTTTAAATG,TGATTTTTACAATTTTAAATG
AGAGTTGAATGGC,AGTGTTGAATGGC,GCCATTCAACTCT,GCCATTCAACACT
GAAGGCCAAATAAAATTG,GAAGGCCGAATAAAATTG,CAATTTTATTTGGCCTTC,CAATTTTATTCGGCCTTC
ATTGCATACTCGAGTCATCCA,ATTGCATACTTGAGTCATCCA,TGGATGACTCGAGTATGCAAT,TGGATGACTCAAGTATGCAAT
TGAGTTTTCAAGGGGTT,TGAGTTTTCAGGGGGTT,AACCCCTTGAAAACTCA,AACCCCCTGAAAACTCA
GCTAGCAAACGTCGCCA,GCTAGCAAACATCGCCA,TGGCGACGTTTGCTAGC,TGGCGATGTTTGCTAGC
GGTGGAGGGAAAAAGCAGTG,GGTGGAGGGGAAAAGCAGTG,CACTGCTTTTTCCCTCCACC,CACTGCTTTTCCCCTCCACC
TGGTCTACTTTGTGC,TGGTCTACTTCGTGC,GCACAAAGTAGACCA,GCACGAAGTAGACCA
CAGGTTGTTGGTTGTT,CAGGTTGTTGTTTGTT,AACAACCAACAACCTG,AACAAACAACAACCTG
TCCCCACCAAAATTAAGCAAA,TCCCCACCAAGATTAAGCAAA,TTTGCTTAATTTTGGTGGGGA,TTTGCTTAATCTTGGTGGGGA
CCCTGGAGATCT,CTCTGGAGATCT,AGATCTCCAGGG,AGATCTCCAGAG
ATGTTACATGTA,ACGTTACATGTA,TACATGTAACAT,TACATGTAACGT
TTTTTGTGTCCGCCATGAATT,TTTTTGTGTCTGCCATGAATT,AATTCATGGCGGACACAAAAA,AATTCATGGCAGACACAAAAA
CAAAAGTCTGTATTTTCAAAA,CAAAAGTCTGCATTTTCAAAA,TTTTGAAAATACAGACTTTTG,TTTTGAAAATGCAGACTTTTG
ACACACACAAGAGACACCCAC,ACACACACAAAAGACACCCAC,GTGGGTGTCTCTTGTGTGTGT,GTGGGTGTCTTTTGTGTGTGT
AACATATGAGTTGTAATGCCC,AACATATGAGATGTAATGCCC,GGGCATTACAACTCATATGTT,GGGCATTACATCTCATATGTT
AAATAAACGCTGGGTCTAATT,AAATAAACGCCGGGTCTAATT,AATTAGACCCAGCGTTTATTT,AATTAGACCCGGCGTTTATTT
TCCCTTGTCTATGGTATATCT,TCCCTTGTCTCTGGTATATCT,AGATATACCATAGACAAGGGA,AGATATACCAGAGACAAGGGA
TAGCCTTAAGCGCTTCCTGCC,TAGCCTTAAGAGCTTCCTGCC,GGCAGGAAGCGCTTAAGGCTA,GGCAGGAAGCTCTTAAGGCTA
CTCTCTGCTTGCGTT,CTCTCTGCTTTCGTT,AACGCAAGCAGAGAG,AACGAAAGCAGAGAG
GCTATTAAAAGG,GTTATTAAAAGG,CCTTTTAATAGC,CCTTTTAATAAC
CCATTATCATTAT,CCCTTATCATTAT,ATAATGATAATGG,ATAATGATAAGGG
TCAAGTGTTTCCTTTATTTTG,TCAAGTGTTTACTTTATTTTG,CAAAATAAAGGAAACACTTGA,CAAAATAAAGTAAACACTTGA
GCCTGACTGGACAACCATTTG,GCCTGACTGGGCAACCATTTG,CAAATGGTTGTCCAGTCAGGC,CAAATGGTTGCCCAGTCAGGC
CCTTTGTCACCGCTCATCAGC,CCTTTGTCACTGCTCATCAGC,GCTGATGAGCGGTGACAAAGG,GCTGATGAGCAGTGACAAAGG
AATGCCATTTTGT,AAAGCCATTTTGT,ACAAAATGGCATT,ACAAAATGGCTTT
GGGCCTTCGGGGTGCCTGTCC,GGGCCTTCGGTGTGCCTGTCC,GGACAGGCACCCCGAAGGCCC,GGACAGGCACACCGAAGGCCC
CATGTCAGTGC,CATGTCACTGC,GCACTGACATG,GCAGTGACATG
TAACTTACAGTC,TAACTTACAGCC,GACTGTAAGTTA,GGCTGTAAGTTA
GGGAGAGGAGGCCTGTCTTTA,GGGAGAGGAGACCTGTCTTTA,TAAAGACAGGCCTCCTCTCCC,TAAAGACAGGTCTCCTCTCCC
TGTGTCTGAGA,TGTGTCCGAGA,TCTCAGACACA,TCTCGGACACA
GAAAACTCTGCCCTG,GAAAACTCTGTCCTG,CAGGGCAGAGTTTTC,CAGGACAGAGTTTTC
CCATATGTCGCTTGT,CCATATGTCGTTTGT,ACAAGCGACATATGG,ACAAACGACATATGG
CTGGCGGGGTCTGGG,CTGGCGGGGTATGGG,CCCAGACCCCGCCAG,CCCATACCCCGCCAG
GGAGTCAGATAC,GAAGTCAGATAC,GTATCTGACTCC,GTATCTGACTTC
TGCAAGTCCTTCAAAGGCTCA,TGCAAGTCCTCCAAAGGCTCA,TGAGCCTTTGAAGGACTTGCA,TGAGCCTTTGGAGGACTTGCA
CCAGTGAGATGCTGTGTTGCA,CCAGTGAGATACTGTGTTGCA,TGCAACACAGCATCTCACTGG,TGCAACACAGTATCTCACTGG
ACTGAAGGAATTTAAC,ACTGAAGGAAGTTAAC,GTTAAATTCCTTCAGT,GTTAACTTCCTTCAGT
TACAGTTTCCTGTCTGA,TACAGTTTCCAGTCTGA,TCAGACAGGAAACTGTA,TCAGACTGGAAACTGTA
AACGTGACACAAT,AACGTGACACGAT,ATTGTGTCACGTT,ATCGTGTCACGTT
CAACATTCCAGTCTGAAAC,CATTCCAGCCTGAAAC,GTTTCAGACTGGAATGTTG,GTTTCAGGCTGGAATG
GTGAACCAATCAAT,GTGAGCCAATCAAT,GTGAACTAATCAAT,GTGAGCTAATCAAT,ATTGATTGGTTCAC,ATTGATTGGCTCAC,ATTGATTAGTTCAC,ATTGATTAGCTCAC
GTCAAACCAACTTTGCCAAGG,GTCAAACCAATTTTGCCAAGG,CCTTGGCAAAGTTGGTTTGAC,CCTTGGCAAAATTGGTTTGAC
TCTCTAAAAAGGTACAGTATA,TCTCTAAAAAAGTACAGTATA,TATACTGTACCTTTTTAGAGA,TATACTGTACTTTTTTAGAGA
GTGACAAGGTAGGGGTTG,GTGACATGGTAGGGGTTG,CAACCCCTACCTTGTCAC,CAACCCCTACCATGTCAC
CACGGTTTACACTCCTATTA,ACGGTTTACACTCCAATTA,TAATAGGAGTGTAAACCGTG,TAATTGGAGTGTAAACCGT
CATCAACACAATCTGC,CATCAACACGATCTGC,GCAGATTGTGTTGATG,GCAGATCGTGTTGATG
AGGGTCTCATGCTCCCT,AGGGTCTCGTGCTCCCT,AGGGAGCATGAGACCCT,AGGGAGCACGAGACCCT
TCACAAATGTATCCTAAAGC,CACAAATGTATACTAAAGC,GCTTTAGGATACATTTGTGA,GCTTTAGTATACATTTGTG
CCAACGGCGACTTG,CCAACGCCGACTTG,CAAGTCGCCGTTGG,CAAGTCGGCGTTGG
TGCATGTAACAAATAACAT,TGCATGTAACATAACAT,ATGTTATTTGTTACATGCA,ATGTTATGTTACATGCA
GTCTCTGACGGTGTGCTTTC,GTCTCTGACTGTGTGCTTTC,GAAAGCACACCGTCAGAGAC,GAAAGCACACAGTCAGAGAC
GTACGGAAAAAACA,GTACGGGAAAAACA,TGTTTTTTCCGTAC,TGTTTTTCCCGTAC
GGTTACATCCCAAA,GGTTACACCCCAAA,GGTTACGTCCCAAA,GGTTACGCCCCAAA,TTTGGGATGTAACC,TTTGGGGTGTAACC,TTTGGGACGTAACC,TTTGGGGCGTAACC
TCGAACTCCGCTCCTAG,TCGAACTCCACTCCTAG,CTAGGAGCGGAGTTCGA,CTAGGAGTGGAGTTCGA
CCACTAAGGATTACGTTACG,CACTAAGGATTACATTACG,CGTAACGTAATCCTTAGTGG,CGTAATGTAATCCTTAGTG
TACAGATGTCATTTTAC,CTACAGATGTAATTTTAC,GTAAAATGACATCTGTA,GTAAAATTACATCTGTAG
TGGAATGGGTAAGGTGTA,TGGAATGTGTAAGGTGTA,TACACCTTACCCATTCCA,TACACCTTACACATTCCA
TCCTTCTCACGCTTCT,CTCCTTCTCATGCTTCT,AGAAGCGTGAGAAGGA,AGAAGCATGAGAAGGAG
CCCGCGGTGAGTAT,CCCGCTGTGAGTAT,ATACTCACCGCGGG,ATACTCACAGCGGG
CCTCCTGGGTATATCG,CTCCTGGGCATATCG,CGATATACCCAGGAGG,CGATATGCCCAGGAG
CATCTGGCAATGCCTT,CATCTGGCAGTGCCTT,AAGGCATTGCCAGATG,AAGGCACTGCCAGATG
GCATTTTAAAAATC,GCATTTAAAAAATC,GATTTTTAAAATGC,GATTTTTTAAATGC
CCGTGGTATTGTTTCAA,CCGTGGTATTCTTTCAA,TTGAAACAATACCACGG,TTGAAAGAATACCACGG
TGTATGACCTCTGACCTGT,TGTATGACCTCTAACCTGT,ACAGGTCAGAGGTCATACA,ACAGGTTAGAGGTCATACA
TCTCTGCTCATCTGTC,CTCTGCTCGTCTGTC,GACAGATGAGCAGAGA,GACAGACGAGCAGAG
ATCAAGCTGACGAACCA,CAAGCTGACAAACCA,TGGTTCGTCAGCTTGAT,TGGTTTGTCAGCTTG
TGACTCTCAGCATCTG,TGACTCTCAGCAACTG,TGACTCTCTGCATCTG,TGACTCTCTGCAACTG,CAGATGCTGAGAGTCA,CAGTTGCTGAGAGTCA,CAGATGCAGAGAGTCA,CAGTTGCAGAGAGTCA
AGCTCCATGCGGACT,AGCTCCACGCGGACT,AGTCCGCATGGAGCT,AGTCCGCGTGGAGCT
CTGGACCAGAACTCTGA,CTGGACCAGATCTCTGA,TCAGAGTTCTGGTCCAG,TCAGAGATCTGGTCCAG
AAGGTGCCTTCCCC,AAAGTGCCTTCCCC,AAGGTGGCTTCCCC,AAAGTGGCTTCCCC,GGGGAAGGCACCTT,GGGGAAGGCACTTT,GGGGAAGCCACCTT,GGGGAAGCCACTTT
CTCTTCGATGTCTAGACA,CTCTTCAATGTCTAGACA,TGTCTAGACATCGAAGAG,TGTCTAGACATTGAAGAG
GCACGATGCAGAAC,GCACGATGTAGAAC,GCACGAAGCAGAAC,GCACGAAGTAGAAC,GTTCTGCATCGTGC,GTTCTACATCGTGC,GTTCTGCTTCGTGC,GTTCTACTTCGTGC
GAGAGCCGAGCTTT,GAGAGCTGAGCTTT,AAAGCTCGGCTCTC,AAAGCTCAGCTCTC
CCACGTAGCGATCG,ACCACATAGCGATCG,CGATCGCTACGTGG,CGATCGCTATGTGGT
AAGTCAGCATCTTTCA,AGTCAGCGTCTTTCA,TGAAAGATGCTGACTT,TGAAAGACGCTGACT
ATGGAGGATTGTGGTTGT,ATGGAGGATTCTGGTTGT,ACAACCACAATCCTCCAT,ACAACCAGAATCCTCCAT
AAGAAGCATTTTTTGG,AAGAAGCATTTTTTGT,AAGAAGCATTTTTTTG,AAGAAGCATTTTTTTT,AAGAAGCATTTATTTT,CCAAAAAATGCTTCTT,ACAAAAAATGCTTCTT,CAAAAAAATGCTTCTT,AAAAAAAATGCTTCTT,AAAATAAATGCTTCTT
TATTGGTCAGGGAA,TATTGGCCAGGGAA,TTCCCTGACCAATA,TTCCCTGGCCAATA
CTCCCACAAACCC,TCCCAGAAACCC,GGGTTTGTGGGAG,GGGTTTCTGGGA
TCCGTTAGTTCATCCTGG,TCCGTTAGTTCCTCCTGG,CCAGGATGAACTAACGGA,CCAGGAGGAACTAACGGA
ACAGATCCATCCACCACT,AGATCCAGCCACCACT,AGTGGTGGATGGATCTGT,AGTGGTGGCTGGATCT
CTGGACGCCGTTACA,TGGACGCCATTACA,TGTAACGGCGTCCAG,TGTAATGGCGTCCA
AGTGCGAAGAACC,AGTGCAAAGAACC,GGTTCTTCGCACT,GGTTCTTTGCACT
AGCAATCCCACAGC,AGCAATCACACAGC,AGAAATCCCACAGC,AGAAATCACACAGC,AGCAATTCCACAGC,AGCAATTACACAGC,AGAAATTCCACAGC,AGAAATTACACAGC,GCTGTGGGATTGCT,GCTGTGTGATTGCT,GCTGTGGGATTTCT,GCTGTGTGATTTCT,GCTGTGGAATTGCT,GCTGTGTAATTGCT,GCTGTGGAATTTCT,GCTGTGTAATTTCT
ATGGCCCTTACACTATC,TGGCCCTTACGCTATC,GATAGTGTAAGGGCCAT,GATAGCGTAAGGGCCA
ACAGAGAGAAGTCCCAGGTG,AGAGAGAAGCCCCAGGTG,CACCTGGGACTTCTCTCTGT,CACCTGGGGCTTCTCTCT
CCAGATGAACAACTTCAC,CCAGATGAGCAACTTCAC,GTGAAGTTGTTCATCTGG,GTGAAGTTGCTCATCTGG
CAGCTGTCCAGTTCTG,CAGTTGTCCAGTTCTG,CAGAACTGGACAGCTG,CAGAACTGGACAACTG
CTAACCCGGACGAC,CTAACCCGGACAAC,CTAACCCGGAAGAC,CTAACCCGGAAAAC,CTAACCTGGACGAC,CTAACCTGGACAAC,CTAACCTGGAAGAC,CTAACCTGGAAAAC,GTCGTCCGGGTTAG,GTTGTCCGGGTTAG,GTCTTCCGGGTTAG,GTTTTCCGGGTTAG,GTCGTCCAGGTTAG,GTTGTCCAGGTTAG,GTCTTCCAGGTTAG,GTTTTCCAGGTTAG
CTGGGTCGGCGCT,TGGGTCGACGCTC,AGCGCCGACCCAG,GAGCGTCGACCCA
TAGTAGCCCCTACACCTC,TAGCCCCTGCACCTC,GAGGTGTAGGGGCTACTA,GAGGTGCAGGGGCTA
CTGGCTGTAAACGAAGA,TGGCTGTAAACAAAGA,TCTTCGTTTACAGCCAG,TCTTTGTTTACAGCCA
TAGCCGTCACCGAT,TAGCCGCCACCGAT,ATCGGTGACGGCTA,ATCGGTGGCGGCTA
CATCCTGCTGGACCC,CATCCTGTTGGACCC,GGGTCCAGCAGGATG,GGGTCCAACAGGATG
TGGAGGTGGAGGAG,TGGAGGGGGAGGAG,CTCCTCCACCTCCA,CTCCTCCCCCTCCA
CGACACCACTTACA,CGACACTACTTACA,TGTAAGTGGTGTCG,TGTAAGTAGTGTCG
CCTTCCCTCCTAGGGCAACGT,CCTTCCCTCCCAGGGCAACGT,ACGTTGCCCTAGGAGGGAAGG,ACGTTGCCCTGGGAGGGAAGG
CCTATGAAGTT,CCGATGAAGTT,AACTTCATAGG,AACTTCATCGG
TTCACGTACGGCCCAT,TTCACATACGGCCCAT,ATGGGCCGTACGTGAA,ATGGGCCGTATGTGAA
ACCCATGAATAAGGACGAGAG,ACCCATGAATGAGGACGAGAG,CTCTCGTCCTTATTCATGGGT,CTCTCGTCCTCATTCATGGGT
CATCTTAGCCTCTCTGACCCC,CATCTTAGCCCCTCTGACCCC,GGGGTCAGAGAGGCTAAGATG,GGGGTCAGAGGGGCTAAGATG
CCTGAGATTAGG,CCTGAGATTAAG,CCTAATCTCAGG,CTTAATCTCAGG
TGATCATATCTCGTTCAGT,TGATCATACCTCGTTCAGT,ACTGAACGAGATATGATCA,ACTGAACGAGGTATGATCA
TCATTTTTGCAGAGAGAGAAT,TCATTTTTGCGGAGAGAGAAT,ATTCTCTCTCTGCAAAAATGA,ATTCTCTCTCCGCAAAAATGA
AGCCAATTGTAGCCTTAGTGC,AGCCAATTGTTGCCTTAGTGC,GCACTAAGGCTACAATTGGCT,GCACTAAGGCAACAATTGGCT
GTTGGGAGCGTCCCAAAATGG,GTTGGGAGCGGCCCAAAATGG,CCATTTTGGGACGCTCCCAAC,CCATTTTGGGCCGCTCCCAAC
AGCCTCTTCCTCTCTG,AGCCTCTTCCCCTCTG,CAGAGAGGAAGAGGCT,CAGAGGGGAAGAGGCT
GACCTCAAGCAGTCAG,GACCTTAAGCAGTCAG,CTGACTGCTTGAGGTC,CTGACTGCTTAAGGTC
AGATGAACACCAACTGGCCGG,AGATGAACACTAACTGGCCGG,CCGGCCAGTTGGTGTTCATCT,CCGGCCAGTTAGTGTTCATCT
CCTGCACACATGTCAAACCG,CCTGCACACGTGTCAAACCG,CGGTTTGACATGTGTGCAGG,CGGTTTGACACGTGTGCAGG
GTGTGAAAGGGGAGAAGGGCT,GTGTGAAAGGAGAGAAGGGCT,AGCCCTTCTCCCCTTTCACAC,AGCCCTTCTCTCCTTTCACAC
AGTCTGTCGTTGT,AGGCTGTCGTTGT,ACAACGACAGACT,ACAACGACAGCCT
GCAAATCTCCGATGTAAAGT,GCAAATCTCTGATGTAAAGT,ACTTTACATCGGAGATTTGC,ACTTTACATCAGAGATTTGC
TATTCAAAAGGAGCAGTTCAT,TATTCAAAAGAAGCAGTTCAT,ATGAACTGCTCCTTTTGAATA,ATGAACTGCTTCTTTTGAATA
ATAGTACATGCAGCATTTGCA,ATAGTACATGTAGCATTTGCA,TGCAAATGCTGCATGTACTAT,TGCAAATGCTACATGTACTAT
CTCTTTAGTACAGACTGAACA,CTCTTTAGTATAGACTGAACA,TGTTCAGTCTGTACTAAAGAG,TGTTCAGTCTATACTAAAGAG
AGCCTATAGAGTCCATTCAGA,AGCCTATAGAGTCCATACAGA,TCTGAATGGACTCTATAGGCT,TCTGTATGGACTCTATAGGCT
AACAAGGCCAAATCAGGAAGT,AACAAGGCCATATCAGGAAGT,ACTTCCTGATTTGGCCTTGTT,ACTTCCTGATATGGCCTTGTT
TGCTGACTCATCATTTACT,TGCTGACTCATCTTTTACT,AGTAAATGATGAGTCAGCA,AGTAAAAGATGAGTCAGCA"
for i in $PROBES
do
cat ./merged_Ots356_hash.txt | /dfs/Omalley_Lab/dayan/software/seqkit grep -s -i -p $i | grep ">" | awk -F ";" '{sum += $3} END {print sum}' >> Ots356_probe_cts.txt
done
After running these scripts combine the results in a text editor with marker names.
counts_356 <- read_tsv("results/primer_probe_counts_356.txt")
#lets get rid of the sex marker
counts_356 %<>%
filter(marker != "Ots_SEXY3-1")
# for this step it is best to look at reads containing the primer alone, as a read-hog primer can waste reads even if it amplifies off-target regions.
ggplot(data = counts_356)+geom_histogram(aes(x = primer_count))+theme_classic()
Looks like there are a handful of read hogs in there, let’s examine those first.
Here we take an arbitrary cutoff (drawn from the read distribution above) of greater than 1,800,000 reads
counts_356$portion_primer_reads <- counts_356$primer_count/sum(counts_356$primer_count)
kable(arrange(counts_356[counts_356$primer_count>18e5,], desc(primer_count)), digits = 3 )
| marker | primer_count | probe_count | portion_primer_reads |
|---|---|---|---|
| Ots_u07-57.120 | 7041855 | 42686 | 0.075 |
| Ots17_1488679_C6 | 3460621 | 63501 | 0.037 |
| Ots_hsp27b-150 | 3145749 | 74967 | 0.034 |
| Ots17_1345774_C6 | 2546539 | 267874 | 0.027 |
| Ots_103122-180 | 2486138 | 37792 | 0.027 |
| Ots_myoD-364 | 2261326 | 101088 | 0.024 |
sum(counts_356[counts_356$primer_count>18e5,4])
## [1] 0.2232756
sum(counts_356[counts_356$primer_count>5e5,4])
## [1] 0.5203962
plot(log(counts_356$portion_primer_reads[1:350]), log(counts_351$portion_primer_reads), xlab = "log portion of primer reads in Ots356", ylab = "log portion of primer reads in Ots351")
We see here that the top six most observed primers (of 350) account for 22% of total reads, the top 42 get 52%. The top six are shared across the two dataset and there is very good agreement between datasets.
All of these demonstrate highly skewed primer/probe ratios, indicative of poor specificity for these primers. These are great candidates for removal or a change to the primer sequence to improve specificity.
There are, however, a few markers that demonstrate minor differences between runs, this could be infomrative if we find that they might interact with the new primers.let’s pull these out by fitting a linear regression and save them for later.
lm1 <- lm(log(counts_356$portion_primer_reads[1:350]) ~ log(counts_351$portion_primer_reads))
resids <- residuals(lm1)
lm_results <- data.frame(counts_356$marker[1:350], counts_356$portion_primer_reads[1:350], counts_351$portion_primer_reads, resids)
lm_results %>%
slice_max(resids, n =5)
Failure to amplify
While some low amplifying primers can be “rescued” by excluding read hogs from future runs, primers with zero or extremely low amplification may need to be redesigned, or the in silico sequence used may need to modified to accomodate unaccounted for variants in the primer region. Such markers might show low/zero primer counts, but large numbers of probe counts. Let’s look for these.
Below we plot the read distribution overall (blue) (estimated by primer counts), vs reads from markers with less than 10% of the median number of reads.
zero_primers_356 <- counts_356 %>%
filter(primer_count < 0.1* median(primer_count))
ggplot(data = zero_primers_356)+geom_histogram(aes(x = probe_count), fill = "red", alpha = 0.7)+geom_histogram(data = counts_356, aes(x = probe_count), fill = "blue", alpha = 0.5)+ theme_classic()
1 marker (Ots_108735-302) has very poor amplification (just 10k reads). It also has very few probe counts so it is unlikely it’s dropping out because of variation in the primer region.
First let’s look at the distribution of primer and probe counts alone.
Primers first (ignoring the read hogs).
# for this step it is best to look at reads containing the primer alone, as a read-hog primer can waste reads even if it amplifies off-target regions.
ggplot(data = counts_356)+geom_histogram(aes(x = primer_count))+theme_classic()+xlim(-1000, 1800000)
Probes next.
ggplot(data = counts_356)+geom_histogram(aes(x = probe_count),)+theme_classic()
ggplot(data = counts_356)+geom_histogram(aes(x = probe_count),)+theme_classic()+xlim(-1000, 500000)+ggtitle("probe count\ntruncated at 500k")
Ratios
Now let’s take a look at the ratio of those values.
# for this step it is best to look at reads containing the primer alone, as a read-hog primer can waste reads even if it amplifies off-target regions.
ggplot(data = counts_356)+geom_histogram(aes(x = log10( primer_count/probe_count)))+theme_classic()+geom_vline(aes(xintercept = 0.69897), color = "blue")+geom_vline(aes(xintercept = -0.69897), color = "red")
Most primers and probes have sufficient specificity (i.e. primer:Probe ratio between 1:5 and 5:1). Most primer:probe ratios are skewed towards MORE primers than probes. Only a handful are strongly skewed (ratio greater than 5x) and one is strongly skewed in the opposite direction (less than 1/5). Positive skewed primer:probe ratios can be due to several effects:
Let’s look more closely by seeing if this is skewed towards high amplifiers.
ggplot(data = counts_356)+geom_point(aes((probe_count), (primer_count)), alpha = 0.5)+geom_abline(aes(slope = 1, intercept = 0))+xlim(4,7)+ylim(4,7)+scale_x_log10()+scale_y_log10()+theme_classic()
No Just like with the Ots351 data, It looks like the high primer:probe ratio markers (i.e. strong positive residuals from the 1:1 line in the plot above) occur across the distrubution of primer counts, suggesting these primers are could be either amplifying off-target sites or that there are bad probes/probe region variation. This warrants looking into more closely, but there’s not much more we can do with primer and probe counts alone.
high_ratio_markers_356 <- counts_356[which(counts_356$primer_count/counts_356$probe_count >5),1]
low_ratio_markers_356 <- counts_356[which(counts_356$primer_count/counts_356$probe_count < 0.2),1]
Let’s also look just at the new markers and make sure they check out
kable(counts_356 %>%
filter(str_detect(marker, "Ots37124")))
| marker | primer_count | probe_count | portion_primer_reads |
|---|---|---|---|
| Ots37124-12267397 | 107096 | 108136 | 0.0011418 |
| Ots37124-12272852 | 213437 | 143019 | 0.0022756 |
| Ots37124-12277401 | 266595 | 261957 | 0.0028423 |
| Ots37124-12281207 | 330477 | 306201 | 0.0035234 |
| Ots37124-12310649 | 183269 | 175688 | 0.0019539 |
These look near perfect!!! Near median depth, very good primer:probe ratios.
note used the allele correction values shared in the OtsGtseqv6.1 panel info shared from CRITFC
#!/bin/bash
#$ -S /bin/bash
#$ -t 96-194
#$ -tc 20
#$ -N GTseq-genotyperv3
#$ -cwd
#$ -o $JOB_NAME_$TASK_ID.out
#$ -e $JOB_NAME_$TASK_ID.err
export PERL5LIB='/home/fw/dayand/perl5/lib/perl5/x86_64-linux-thread-multi/'
FASTQS=(`ls -1 ../../raw_reads/Ots356/*fastq`)
INFILE=${FASTQS[$SGE_TASK_ID -1]}
OUTFILE=$(basename ${INFILE%.fastq}.genos)
GTSEQ_GENO="/dfs/Omalley_Lab/dayan/software/GTseq-Pipeline/GTseq_Genotyper_v3.1.pl
"
PROBE_SEQS="/dfs/Omalley_Lab/dayan/coastal_chinook_2020/75bp_panel_qc/genotyping/Ots356/Ots356_probeseqs.csv"
perl $GTSEQ_GENO $PROBE_SEQS $INFILE > $OUTFILE
#save this code chunk as a file on the server and submit this with qsub -q harold scriptname from the directory you want the output .genos files
SGE_Batch -q harold -r compile -c 'perl /dfs/Omalley_Lab/dayan/software/GTseq-Pipeline/GTseq_GenoCompile_v3.pl > Ots356_coastal_chinook_run018_genos_0.1.csv'
#also collect marker info
touch marker_info_356.txt
for file in ./*genos
do
awk ' FS="," {print FILENAME,$1,$2,$3,$6,$7,$8}' $file >> marker_info_356.txt
done
Normally we would do some filtering at this step, but we want to look at ALL loci, so moving forward with no filtering.
gt_356 <- read_csv("results/Ots356_coastal_chinook_run018_genos_0.1.csv")
marker_info356 <- read_tsv("results/marker_info_356.txt")
marker_info356$a1_count <- as.numeric(marker_info356$a1_count)
marker_info356$a2_count <- as.numeric(marker_info356$a2_count)
marker_info356$total_count <- marker_info356$a1_count+marker_info356$a2_count
marker_info356 %>%
summarise(mean=mean(total_count, na.rm = TRUE), median=median(total_count, na.rm = TRUE), sd=sd(total_count, na.rm = TRUE))
gt %<>%
filter(str_detect(Sample, "negative") == FALSE) %>%
filter(str_detect(Sample, "positive") == FALSE)
ggplot(data = marker_info356)+geom_density(aes(x = (total_count)), fill = "#00204DFF", alpha = 0.5)+theme_classic()+geom_vline(aes(xintercept = median(marker_info356$total_count, na.rm = TRUE)))+xlab("depth per locus per individual\nmedian depth = 460 (vertical line)")
ggplot(data = marker_info356)+geom_density(aes(x = (total_count)), fill = "#00204DFF", alpha = 0.5)+theme_classic()+geom_vline(aes(xintercept = median(marker_info356$total_count, na.rm = TRUE)))+xlab("depth per locus per individual\nmedian depth = 460 (vertical line)")+xlim(0,2000)+ggtitle("truncated at depth of 2k per individual per locus")
## Warning: Removed 7471 rows containing non-finite values (stat_density).
sum(gt_356$`On-Target Reads`)/sum(gt_356$`Raw Reads`)
## [1] 0.4573618
Very similar to other plates. 46% of reads assigned to individuals were on-target.
Let’s also compare these results to the previous plates
First let’s compare sample genotyping stats
gt %<>%
mutate(sample_simple = str_extract(Sample, "[:upper:][:lower:]{2}[AJCU][RC]\\d{2}\\w{4}_\\d{4}"))
gt_356 %<>%
mutate(sample_simple = str_extract(Sample, "[:upper:][:lower:]{2}[AJCU][RC]\\d{2}\\w{4}_\\d{4}"))
gt_combined <- left_join(select(gt, sample_simple, `%On-Target`), select(gt_356, sample_simple, `%On-Target`), by = c("sample_simple" = "sample_simple"))
ggplot(gt_combined)+geom_point(aes(x = `%On-Target.x`, `%On-Target.y`))+geom_abline(aes(slope = 1, intercept = 0))+theme_bw()
Ok, pretty good agreement in genotyping success across different plates. This suggests we can make meaningful comparisons across Ots351 and Ots356 datasets (as we do above). Also something interesting here I’ve never noticed before, there is a lot of variation in %on-target within a dataset, given that this is consistent across two library preps, this suggest a lot of off-target reads is due to tissue or DNA quality and not anything to do with the pipeline or genomic variation.
Now, more interestingly, let’s compare marker level stats. This will be the first indication if adding the new markers led to any issues. note actually let’s just leave this to the GTseq Seqtest section above, this will require A LOT of data wrangling to get less meaningful results than can be provided by those data
Here we plot allele counts for all loci and examine if any markers have a skewed allele count ratio for heterozygotes
library(lme4)
hets <- filter(marker_info356, called_geno == "HET" | is.na(called_geno))
models <- hets %>%
group_by(marker) %>%
group_map(~ lm(a1_count ~ a2_count, data= .))
# Apply coef to each model and return a list of allele count ratios
lms <- lapply(models, coef)
ggplot()+geom_histogram(aes(x = sapply(lms,`[`,2)))+theme_classic()+ggtitle("allele ratios for all NA and HET calls")
#list of p-values
lms_anova <- lapply(models, summary)
# collect info about each bad model
paralog_possible_356 <- which(abs(sapply(lms,`[`,2)) > 1.5) #bad because a positively skewed allele ratio
paralog_possible2_356 <- which(abs(sapply(lms,`[`,2)) < (2/3)) # bad because a negative skewed allele ratio
paralog_possible3_356 <- which(sapply(lms_anova, function(x) x$coefficients[,4][2])> 0.01) # bad because too much variance in allele ratio, even if mean ratio is 1
paralog_possible_356 <- c(paralog_possible_356, paralog_possible2_356, paralog_possible3_356)
15 markers have skewed allele ratios (greater than 1.5), or high variance in allele ratios at heterozygote GTs, indicative of a possible paralogs. Interestingly only 11 of the 15 are shared with those from the Ots351 dataset.
Let’s take a look at these.
plots_356 <- marker_info356 %>%
group_by(marker) %>%
do(plots=ggplot(data=.)+geom_point(aes(a1_count, a2_count, color = called_geno))+theme_classic()+geom_abline(aes(slope=1, intercept=0))+geom_abline(aes(slope = 10, intercept=0), color = "green")+geom_abline(aes(slope = 0.1, intercept=0), color = "red")+geom_abline(aes(slope = 0.2, intercept=0), color = "blue")+geom_abline(aes(slope = 5, intercept=0), color = "blue")+coord_equal(ratio=1)+geom_abline(slope = -1, intercept = 10)+ggtitle(unique(.$marker)))
# we did this manually by iterating the command plots$plots[[paralog_possible_356[X]]] for x = 1:length(paralog_possible) and flagging any problematic markers, which we examine in detail later
10 of these look like they have issues. They are plotted below.
a <- plots$plots[[paralog_possible[1]]]
b <- plots$plots[[paralog_possible[2]]]
c <- plots$plots[[paralog_possible[4]]]
d <- plots$plots[[paralog_possible[5]]]
e <- plots$plots[[paralog_possible[6]]]
f <- plots$plots[[paralog_possible[7]]]
g <- plots$plots[[paralog_possible[10]]]
h <- plots$plots[[paralog_possible[11]]]
i <- plots$plots[[paralog_possible[14]]]
j <- plots$plots[[paralog_possible[15]]]
cowplot::plot_grid(a,b,c,d,e,f,g,h,i,j , ncol = 2 )
rm(a,b,c,d,e,f,g,h,i,j)
A note here for later: we could apply the approach in the “allele dropout” section with the approach I took earlier in the paralog investigation notebook to very quickly pull reads from markers with multiple allele ratios among heterozygotes and check if variatns are responsible for the behavior. this would alleviate some of the problems with the read mapping based approach I used there.
Checking if allele dropout is consistent across datasets is very useful because it can be difficult to parse a null allele from a poor genotyping due to chance alone.
marker_info356$total_count <- marker_info356$a1_count+marker_info356$a2_count
#lets get rid of the negative and positive controls for this step
marker_info356 %<>%
filter(str_detect(ind, "negative") == FALSE) %>%
filter(str_detect(ind, "positive") == FALSE)
geno_counts356 <- marker_info356 %>%
select(total_count, marker, ind) %>%
pivot_wider(id_cols = ind, names_from = marker, values_from = total_count)
# now lets normalize the read counts by the total depth at each individual to better identify locus level outliers
ind_depth356 <- rowSums(select(geno_counts356, -ind))
normalized_geno_counts356 <- geno_counts356 %>%
select(-ind) %>%
mutate_all(~ ./ind_depth)*100
#hist(unlist(normalized_geno_counts[,-1]), breaks = 500, main = "normalized % total reads per locus per individual")
#hist(unlist(normalized_geno_counts[,-1]), xlim = c(0,1), breaks = 500, main = "normalized % total reads per locus per individual\nreadhogs excluded")
# but these zeros/low scores could also be from bad markers across ALL samples, let's normalize across columns too
normalized2_356 <- as.data.frame(cbind(normalized_geno_counts356[,1], (scale(normalized_geno_counts356[,-1], center=FALSE, scale=colSums(normalized_geno_counts356[,-1])))))
hist(table(which(normalized2_356 < 0.001, arr.ind = TRUE)[,2]), main= "number of individuals within a marker\ndemonstrating less than 0.1% normalized read depth")
hist(table(which(normalized2_356 == 0, arr.ind = TRUE)[,2]), main= "number of individuals within a marker\ndemonstrating 0 read depth")
# okay but now lets see if this just from extremely low coverage individuals
bad_depths356 <- ind_depth[which(normalized2_356 == 0, arr.ind = TRUE)[,1]]
ggplot()+geom_histogram(aes(x = ind_depth), color = "red", fill = "red", alpha = 0.2)+geom_histogram(aes(x = bad_depths356), color = "blue", fill = "blue", alpha = 0.2)+theme_classic()+ggtitle("depths of individuals contributing to zero read GTs (blue), vs all depths (red)")
bad_depths2_356 <- ind_depth[which(normalized2_356 < 0.001, arr.ind = TRUE)[,1]]
ggplot()+geom_histogram(aes(x = ind_depth), color = "red", fill = "red", alpha = 0.2)+geom_histogram(aes(x = bad_depths2_356), color = "blue", fill = "blue", alpha = 0.2)+theme_classic()+ggtitle("depths of individuals contributing to low read read GTs (blue), vs all depths (red)")
Yes, there are some markers here that sequence well overall, but sequence poorly in a subset of individuals that otherwise sequence well. This suggests that there is some unnaccounted for variation. Are these by chance alone? Let’s see if some particular markers are over represented, suggesting that there is indeed a variant out there. The table below show the number of individuals (out of 187) where the normalized number of reads per marker and individual is less than 0.1% of the mean.
bad_table356 <- as_tibble(which(normalized2_356 < 0.001, arr.ind = TRUE))
colnames(bad_table356) <- c("ind_index", "marker_index")
bad_counts356 <- bad_table356 %>%
mutate(marker = colnames(normalized2_356)[marker_index]) %>%
count(marker, sort= TRUE)
bad_counts356
Pretty good overlap between this run and previous (Ots351)
Do any of the markers with a lot of dropout individuals also demonstrate strange paralog behavior?
plots[[1]][paralog_possible_356][plots_356[[1]][paralog_possible_356] %in% bad_counts356$marker]
## [1] "Ots_crRAD24807-74" "Ots_MHC2" "Ots_110689-218"
## [4] "Ots_Cath_D141" "Ots_OTSMTA-SNP1" "Ots_u07-57.120"
## [7] "Ots_wenYhap_25067_92"
Yes, 7 (above) out of the 15 markers also show some evidence of being a paralog.
Check for variants in primer or probe regions
geno_counts356[geno_counts356$`Ots_Est1363`== 0, 1]
geno_counts[geno_counts$`Ots_Est1363`<= 10, 1]
geno_counts[geno_counts$`Ots28_11160599`<= 15, 1]
geno_counts[geno_counts$`Ots_128302-57`<= 0, 1]
#now pull the corresponding sequence data out of these samples and check for variation
#for each questiionable marker concatenate fastqs all the individuals with allele dropout
#then count the number of reads from these individuals using exact matching using bbduk (remember to set kmer size to primer or probe length)
# now do the same process but allow for edit distance of 2, are there more reads that make it through?
############# Ots_105401-325
# list of files with zero reads at this marker
cat OtsAC20SILR_0005_AATGTC_GCAGAT_S149_R1_001.fastq OtsAC20SILR_0028_AATGTC_CTGGTT_S172_R1_001.fastq OtsAC20SILR_0042_AATGTC_ATGTTG_S186_R1_001.fastq OtsAC20SILR_0052_ACCGGA_AACGTT_S297_R1_001.fastq OtsAC20SILR_0080_ACCGGA_GAGAGA_S325_R1_001.fastq OtsAC20SILR_0093_ACCGGA_AGTAGG_S338_R1_001.fastq OtsAC20SILR_0119_ACCGGA_CTGGTT_S364_R1_001.fastq OtsAC20SILR_0127_ACCGGA_CTTATG_S372_R1_001.fastq OtsAC20TRAR_0028_AATGTC_CGGTCC_S124_R1_001.fastq > ../Ots_105401-325_inds.356.fastq
#exact matches
/local/cluster/bbmap/bbduk.sh -Xmx33M in=Ots_105401-325_inds.356.fastq outm=Ots_105401-325_inds.primer.fq literal=GAACTGAGCGGCTGCTG k=17 edist=0
#4822 reads
/local/cluster/bbmap/bbduk.sh -Xmx33M in=Ots_105401-325_inds.primer.fq outm=Ots_105401-325_double_match.fq literal=CCCGGACAAGATGAGACAG,CCCGGACAAGATGAGACCG k=18 edist=0
#0 reads
# 0 reads from above
#mismatches
/local/cluster/bbmap/bbduk.sh -Xmx33M in=Ots_105401-325_inds.356.fastq outm=Ots_105401-325_inds.primer.mm.fq literal=GAACTGAGCGGCTGCTG k=17 edist=2
#4938
/local/cluster/bbmap/bbduk.sh -Xmx33M in=Ots_105401-325_inds.primer.mm.fq outm=Ots_105401-325_double_match.mm.fq literal=CCCGGACAAGATGAGACAG,CCCGGACAAGATGAGACCG k=19 edist=2
#0 reads
cat Ots_105401-325_double_match.mm.fq | sed -n '2~4p' | sort | uniq -c | sort -n -r
Created the input file for the GTseq_Primer-Interaction-Test, by pulling the forward and reverse primer sequences from OtsGtseqv6.1 panel info file and also “Rosa primer sequences.” Trimmed the latter to have only the marker specific region in the file.
SGE_Batch -q harold -c "perl /dfs/Omalley_Lab/dayan/software/GTseq-Pipeline/GTseq_Primer-Interaction-Test.pl primer_interaction_test_input.txt merged_Ots356_hash.txt > primer_interaction_results.txt" -r interaction
Here we’ll examine the primer interaction results. In particular for each primer pair, we will consider the proportion of reads that exhibit evidence of primer interaction for each marker examined
#first lets look at only double complement, fwd-rev, and fwd-fwd interactions
#this will allow us to scale the number of interactions by the fwd primer counts in the hash file and get a good sense of how bad the problem is (with the cost of skipping some rev-rev primer and rev-fwd interactions)
primer_ints <- read_tsv("results/primer_interaction_for_r.txt")
primer_ints_fwd <- primer_ints %>%
filter(type != "rf" & type != "rr")
#get reads contriuting to marker in genotyping pipeline
reads_per_marker <- marker_info356 %>%
group_by(marker) %>%
summarise(total_reads = sum(total_count, na.rm = TRUE))
primer_ints_fwd %<>%
mutate(marker1 = str_replace(marker1, "Ots37124_", "Ots37124-")) %>%
mutate(marker2 = str_replace(marker2, "Ots37124_", "Ots37124-")) %>%# some markers notated differently...
left_join(select(counts_356, marker, primer_count, probe_count), by = c("marker1" = "marker")) %>%
left_join(reads_per_marker, by = c("marker1" = "marker")) %>%
mutate(proportion = count/primer_count)
# some have more interaction counts than primer counts (how?)
ggplot(data = primer_ints_fwd)+geom_histogram(aes(x = proportion))+scale_y_log10()+theme_bw()+geom_vline(aes(xintercept = 0.1), color = "red")
In 39 markers (identified by the presence of the FWD primer), greater than 10% of the reads contain primer sequence from another marker. 22 markers have more than half of their reads derived from possible primer-primer interactions.
Let’s look in more detail at Ots3_34894254. This marker had 92703 FWD primer counts, 96256 probe counts and 88638 both (reads that contribute to genotyping). However, every single one of the 92703 reads with the FWD primer contained the reverse complement of the Ots7_50997124 reverse primer. The alignment between these primers is shown n code chunk below.
EMBOSS_001 1 ---------TGATATATTTTGCTGCAATGATCTG 25
...|.||||||||||
EMBOSS_001 1 CATACACCACACTGTATTTTGCTG---------- 24
# below is bash code
#exact matches
/local/cluster/bbmap/bbduk.sh -Xmx33M in=merged_Ots356.fasta outm=Ots3_34894254.primer.fq literal=TGATATATTTTGCTGCAATGATCTG k=25 edist=0
cat Ots3_34894254.primer.fq | sed -n '2~4p' | sort | uniq -c | sort -n -r
I pulled these reads and looked at them directly. Nearly all had a perfect matching probe, as expected. Interestingly only 10bp of the 3’ end of the off-target rev primer rev complement sequence aligned with 5’ end (see alignment below). I think the primer interaction test only looks for 10bp long kmers…cheked the code, yes it only looks for the presence of the final 10mer, so this marker looks bad, but in fact it’s fine. Developing an alternative search strategy should be pretty straightforward as the ideal search program has already been implemented in bbduk (ktrimr).
CATACACCACACTGTATTTTGCTG
*****||||||||||
TGATATATTTTGCTGCAATGATCTGATCACACCATGCAA...probe...
# same as above but for Ots33_19359879 and Ots_104063-132
/local/cluster/bbmap/bbduk.sh -Xmx33M in=merged_Ots356.fasta outm=Ots33_19359879.primer.fq literal=AGCGCCTGTTTTACATAAACACTT k=24 edist=0
cat Ots33_19359879.primer.fq | sed -n '2~4p' | sort | uniq -c | sort -n -r
(1) TGACTTCAACATCGTCCTTCATAATTAAATAAAC
**********||||||||||
(2) AGCGCCTGTTTTACATAAACACTTGTTTTAAATAAACGCCGGGTCTAATTGAAACGTT
(3) AGCGCCTGTTTTACATAAACACTTGTTTTAAATAAACGCTGGGTCTAATTGAAACGTT
(4) AAATAAACGCCGGGTCTAATT
(5) AAATAAACGCTGGGTCTAATT
(1) rev complement of the reverse primer from Ots_104063-132
(2) most common read (73k) with perfect match to primer from Ots33_19359879
(3) second most common read (21k) with perfect match to primer from Ots33_19359879
(4) probe 1
(5) probe 2
But how big of a problem is this? How many reads in the full dataset are potentially from primer dimers?
<15% of reads are potentially from primer dimers The sum of all counts in the primer interaction test result is ~15 of the total reads in the dataset. However the real number of reads lost to primer dimers is possibly less because some reads may be represented more than once in the result, and as we have seen, because the script only looks for 10-mers which means many flagged reads might actualy be just fine.
Let’s look just at the very worst potential primer dimers.
Below are the top 100
kable(primer_ints %>%
mutate(proportion_of_total_reads = (count / 131348821)) %>%
slice_max(count, n =25) )
| marker1 | marker2 | count | type | proportion_of_total_reads |
|---|---|---|---|---|
| Ots_Thio | Ots18_3417174 | 2428817 | rf | 0.0184913 |
| Ots4_42378741 | Ots_crRAD36152-44 | 2004442 | rf | 0.0152604 |
| Ots_103122-180 | Ots_myoD-364 | 1828166 | ff | 0.0139184 |
| Ots_U5121-34 | Ots_117370-471 | 1714484 | rf | 0.0130529 |
| Ots_sept9-78 | Ots_117242-136 | 1262507 | rf | 0.0096119 |
| Ots_102801-308 | Ots37124_12310649 | 1188811 | rf | 0.0090508 |
| Ots_AldB1-122 | Ots_ppie-245 | 1084668 | rf | 0.0082579 |
| Ots_crRAD57687-34 | Ots_Ostm1 | 526094 | ff | 0.0040053 |
| Ots_crRAD23631-48 | Ots_myoD-364 | 376123 | ff | 0.0028635 |
| Ots_IsoT | Ots_101554-407 | 365216 | rf | 0.0027805 |
| Ots_crRAD21115-24 | Ots_myoD-364 | 314394 | ff | 0.0023936 |
| Ots_u202-161 | Ots_wenYhap_25067_92 | 247535 | rf | 0.0018846 |
| Ots_crRAD20376-66 | Ots18_3426299 | 232467 | fr | 0.0017698 |
| Ots_ETIF1A | Ots_113242-216 | 226923 | fr | 0.0017276 |
| Ots_crRAD12037-39 | Ots_u07-20.332 | 206592 | fr | 0.0015729 |
| Ots_myoD-364 | Ots_U2362-330 | 180544 | fr | 0.0013745 |
| Ots_myoD-364 | Ots_P53 | 165085 | fr | 0.0012568 |
| Ots_GTH2B-550 | Ots_unk7936-50 | 155230 | rf | 0.0011818 |
| Ots_110381-164 | Ots_nelfd-163 | 131927 | fr | 0.0010044 |
| Ots_118938-325 | Ots_105105-613 | 128949 | rf | 0.0009817 |
| Ots_94857-232R | Ots_110381-164 | 121512 | ff | 0.0009251 |
| Ots_Ots311-101x | Ots_crRAD60620-51 | 109611 | fr | 0.0008345 |
| Ots_SEXY3-1 | Ots28_11210919 | 102268 | fr | 0.0007786 |
| Ots33_19359879 | Ots_104063-132 | 99564 | fr | 0.0007580 |
| Ots_crRAD12037-39 | Ots_103122-180 | 98391 | rf | 0.0007491 |
Grabbing the relevant sequences and aligning for the top handful of these primers confirm that the interactions are possible, see code chunk below for some alignments
# Ots_Thio-REV Ots18_3417174-FWD: 1.9% of all reads coming off the sequencer
EMBOSS_001 1 AATACCAAAC-CATGCCACTAATACCT--
.||||| || ||.|.||||||||
EMBOSS_001 1 ----GCAAACTCA-GCAAGTAATACCTCA
# Ots4_42378741 Ots_crRAD36152-44: 1.5% of all reads coming off the sequencer
EMBOSS_001 1 CGTTAACGCTCACCTGCAC---- 19
|...|||||||||
EMBOSS_001 1 ------CCAGCACCTGCACTTTG 17
# Ots_103122-180-FWD and Ots_myoD-364-FWD: 1.4% of al readds coming off the sequencer
EMBOSS_001 1 -CAA--ACGCGCACTCACACACA-
|.| ||.|.|||.||||||||
EMBOSS_001 1 ACGATGACACACACACACACACAC
Interestingly, some primers seems to crop up over and over again in bad primer pairs, suggesting that primer specificity is very low. Let’s take a look at these
a <- primer_ints %>%
slice_max(count, n =100) %>%
group_by(type) %>%
count(marker1, sort = TRUE) %>%
filter(n > 1)
b <- primer_ints %>%
slice_max(count, n =100) %>%
group_by(type) %>%
count(marker2, sort = TRUE) %>%
filter(n > 1)
common_primers <- unique(a[a$marker1 %in% b$marker2,"marker1"])
primer_ints %>%
filter(marker1 %in% common_primers$marker1 | marker2 %in% common_primers$marker1) %>%
summarise(tot_prop = sum(count, na.rm = TRUE)/ 131348821)
primer_ints %>%
filter(marker1 == "Ots_myoD-364" | marker2 == "Ots_myoD-364") %>%
summarise(prop = sum(count)/131348821)
primer_ints %>%
filter(marker1 == "Ots_131906-141" | marker2 == "Ots_131906-141") %>%
summarise(prop = sum(count)/131348821)
primer_ints %>%
filter(marker1 == "Ots28_11077172" | marker2 == "Ots28_11077172") %>%
summarise(prop = sum(count)/131348821)
primer_ints %>%
filter(marker1 == "Ots_103122-180" | marker2 == "Ots_103122-180") %>%
summarise(prop = sum(count)/131348821)
Most of these markers with many hits in the top 100 pairs (rows) of the primer interaction results don’t represent much of a problem. However, the Ots_myoD-364-FWD primer alone is responsible for 2.4% of all reads in the dataset and forms heterodimers leading to greater than 50,000 reads each with 6 other primers. This makes sense because the final 10-mer used for searching is a 2bp tandem repeat (“CACACACACA”/“GTGTGTGTGT”)
Some other common primer that dimerize with a lot of different primers include Ots_131906-141-FWD, Ots_myoD-364-FWD, Ots28_11077172-REV, Ots_103122-180-FWD and Ots_crRAD12037-39-REV
low reads if a lot of primers are forming dimers, this has two costs (1) wasted reads on primer dimers, (2) low read counts for markers because effective priemr concentration is dropped. Let’s look to see if any markers that are high in the primer-interaction results have unusually low on-target reads.
top_50_ints <- primer_ints %>%
slice_max(count, n =50) %>%
pull(marker1, marker2) %>%
unique()
low_50_ont <- counts_351 %>%
slice_min(probe_count, n = 50) %>%
pull(marker)
top_50_ints[top_50_ints %in% low_50_ont]
## [1] "Ots_103122-180" "Ots_U5121-34" "Ots_131906-141"
Yes, some overlap here. Ignoring Ots_103122-180 (already removing this marker for other reasons) Ots_U5121-34-REV forms 1714484 possible dimer sequences with Ots_117370-471-FWD and yet only about 16k reads on target. Ots_131906-141-FWD forms many possible dimers and yet only 30K reads on target. However, there are 1269914 primer hits for this marker.
the primary motivation for this test was to see if the new RoSA markers produced primer-primer interaction leading to wasted sequencing resources. Let’s check just these markers.
primer_ints %>%
filter(str_detect(marker1, "Ots37124") == TRUE | str_detect(marker2, "Ots37124") == TRUE) %>%
slice_max(count, n = 10)
Only one possible interaction severe enough to warrant further investigation: the reverse primer from Ots_102801-308 might be producing a heterodimer with Ots37124_12310649 forward. About 1% of all reads contain a match to both Ots_102801-308-REV and reverse_complement(Ots37124_12310649-FWD) Let’s look at the alignment between these.
CCCAAAGA-TGCTTAACTGAAGATGTG-
.||| || |.|.|||||||||
----GAGAGTG---ACCAGAAGATGTGG
Seems possible… Let’s see if there’s any issue with these markers.
counts_356 %>%
filter(marker == "Ots37124-12310649" | marker == "Ots_102801-308" )
#"Ots_102801-308" %in% paralog_possible_356
#"Ots37124-12310649" %in% paralog_possible_356
Nope, pretty ordinary markers. It’s possible primer dimers are occuring, but it doesn’t seem to be cause any major issues with the on-target sequences.
Up to 15% of the reads in the dataset result from primer-primer interactions. No single primer interaction led to the majority of these reads. Many markers (22) have more than half of their reads derived from possible primer-primer interactions, however looking closely at a few example loci suggests this to be an artifact of using short kmers to search for primer artifacts. This issue may also lead to inflating of all results from the test script used in this section
A handful of primer interactions are particularly problematic: (1) Ots_myoD-364-FWD potentially forms dimers with many other markers’ primers and the last 10mer from this primer is found in 2.4% of all reads in the raw sequence data.
(2) Several primer-primer interaction produce large numbers of reads (up to 1.9% of all sequence data each).
(3) One of the new RoSA markers was excluded because of evidence that it interacted with Ots_103122-180, in previous tests, however, Ots_103122-180 seems to interact with many other primers including one primer-primer interaction (with Ots_103122-180) that consumes as much as 1.4% of all reads.
I suggest dropping Ots_103122-180 and Ots_myoD-364, but it may also be worth looking more closely with a more sensitive search algorithm
Very few difference in results across Ots351 and Ots356. Will talk in recommendations below as if they are the same (i.e. summary figures/results used for discussion may come for either one)
Ots333 <- read_csv("results/coastal_chinook_2021b_GTs_0.1.csv")
Ots333 %<>%
filter(str_detect(Sample, "Ots") == TRUE)
#sum(Ots333$`On-Target Reads`)/sum(Ots333$`Raw Reads`)
The panel looks like it works fairly well. 46% of reads assigned to individuals were on-target (exact match to primer and probe sequence in read), which is substantially better than the current O. mykiss panel (~30% on-target) or the previous Chinook panel (18% on target), although I’d caution that these estimates are from a single run for each and there may be variation in on target reads due to DNA quality or library prep.
There are some issues that might be worth further examination or discussion:
(1) Some read hogs (not worse than the current O. mykiss panel, chum panel or previous Chinook panel). The top 6 amplifying primers compose 23% of all observed primers in the sequence data, and the top 38 (~11%) compose 50% of all primers observed in the dataset. This is similar to the current O. mykiss panel.
The plot below shows a slightly different metric (number of on-target reads per marker) across different libraries and demonstrates that this distribution of reads across markers is typical for our GTseq panels.
#read in other chinook panel results
marker_info_ots333 <- read_tsv("results/marker_info_ots333.txt")
marker_info_ots333$a1_count <- as.numeric(substr(marker_info_ots333$a1_count, 3, nchar(marker_info_ots333$a1_count)))
marker_info_ots333$a2_count <- as.numeric(substr(marker_info_ots333$a2_count, 3, nchar(marker_info_ots333$a2_count)))
marker_info_ots333 %<>%
filter(str_detect(ind, "Ots")== TRUE) %>%
mutate(portion_reads = (a1_count+a2_count)/sum(.$a1_count, .$a2_count, na.rm = TRUE))
marker_info_omy <- read_tsv("results/marker_info_Omy.txt")
marker_info_omy %<>%
filter(str_detect(ind, "Omy")== TRUE) %>%
mutate(a1_count = as.numeric(a1_count)) %>%
mutate(a2_count = as.numeric(a2_count)) %>%
mutate(portion_reads = (a1_count+a2_count)/sum(.$a1_count, .$a2_count, na.rm = TRUE))
marker_info %<>%
mutate(portion_reads = (a1_count+a2_count)/sum(.$a1_count, .$a2_count, na.rm = TRUE))
big_marker_info <- bind_rows(marker_info, marker_info_omy, marker_info_ots333, .id = "panel")
grouped_df <- big_marker_info %>%
group_by(marker, panel) %>%
summarise(percent_panel_reads = sum(portion_reads))
ggplot(data = grouped_df)+geom_density(aes(percent_panel_reads, fill = panel), alpha = 0.5)+scale_fill_viridis_d(labels = c("Ots351_75bp", "Omy_390", "Ots_333"))+theme_classic()+xlab("portion of on-target reads per marker\n\nvertical line demonstrates portion per read\nif all markers were evenly represented among on-target reads")+geom_vline(aes(xintercept = (1/333)), color = "#44015480")+geom_vline(aes(xintercept = (1/390)), color = "#21908C80")+geom_vline(aes(xintercept = (1/350)), color = "#FDE72580")+theme(legend.position = c(0.7, 0.5))
Because the worst over-amplifiers seem to be suffering from low specificity (strongly skewed primer:probe ratios), we could add some bases to the primer sequences (the actual oligos, since in silico changes wouldn’t matter here). It doesn’t sound like this is a viable option since we want to retain compatibility with other labs. However the panel could also be improved by loading some primers at lower concentrations, but the extent to which this is successful is limited by the low probe counts. All but one of the worst over-amplifiers have probe counts too low to drop the concentration without risking losing the marker to insufficient depth (the table below shows some of the panel notes from CRITFC on the markers + the primer and probe count results, note that the median probe count is ~110k).
#median(counts_351$probe_count)
kable(panel_info %>%
left_join(counts_356, by = c("Assay" = "marker")) %>%
filter(primer_count > 13e5) %>%
select(Assay, Concentration, `Panel Origin`, `Presumed Type`, Comment, `Last Action`, primer_count, probe_count, portion_primer_reads))
| Assay | Concentration | Panel Origin | Presumed Type | Comment | Last Action | primer_count | probe_count | portion_primer_reads |
|---|---|---|---|---|---|---|---|---|
| Ots_u07-53.133 | 2X | GSI | Neutral | NA | Active Panel | 1622641 | 1655558 | 0.0172998 |
| Ots_myoD-364 | 2X | GSI | Neutral | Keep at Jons Request | Active Panel | 2261326 | 101088 | 0.0241091 |
| Ots17_1345774_C6 | NA | OtsGTSeqV6.0 | NA | Low OnTarget | Testing | 2546539 | 267874 | 0.0271499 |
| Ots17_1488679_C6 | NA | OtsGTSeqV6.0 | NA | Low OnTarget | Testing | 3460621 | 63501 | 0.0368954 |
| Ots17_1066109_C6 | NA | OtsGTSeqV6.0 | NA | Read Hog with abnormal and fixed plots | Testing | 1667083 | 817855 | 0.0177736 |
| Ots_hsp27b-150 | 1X | GSI | Neutral | NA | Active Panel | 3145749 | 74967 | 0.0335384 |
| Ots_u07-57.120 | .5X | GSI | Neutral | Keep at Jons Request | Active Panel | 7041855 | 42686 | 0.0750767 |
| Ots_103122-180 | .5X | PBT | Neutral | Keep at Jons Request | Active Panel | 2486138 | 37792 | 0.0265060 |
It looks like three of these are test markers in the CRITFC OtsGTseqv6.0 test panel, two (Ots_hsp27b-150 and Ots_u07-53.133) are normal GSI markers with nothing special going on, and three have previously been identified as problem markers by CRITFC but retained because they are important for PBT or GSI.
Four can have thir concentration dropped. My suggestion is to order the rest as single tubes and exclude them from ordinary runs. Rationale for each is below
Change concentration
Ots17_1345774_C6: Half the concentration. Sufficient on-target amplification to allow an ~50% reduction in read depth.
Ots17_1488679_C6: Quarter the concentration. While this marker does not appear to present sufficient probe counts to permit reducing primer concentration, 82% of samples contain a null allele caused by a single base indel. Accomodating this indel with a modified probe sequence increased the on target read counts to 556k, or about 5x times the median. Ots17_1066109_C6: Quarter the concentration. Sufficient on-target amplification to allow at least 4X reduction in read depth.
Ots_u07-53.133 1/4 X concentration (1/8 the current concentration - currently run at 2x concentration), near perfect primer:probe ratio and 16X more on target reads than median.
Exclude from ordinary panel, order as tube
Ots_myoD-364. Order as single tube, exclude from ordinary runs. Not only is this marker an overamplifier responsible for 2.4% of all reads with a perfect match to any primer sequence. It also readily forms dimers with at least 6 other primers due to the repeating dimer at the 3’ end of the FWD primer and is responsible for up to an additional 2.4% of reads from primer dimers.
Ots_103122-180 Same as Ots_myoD-364. Not only a read hog, also forms a lot of primer dimers. Also excluding this marker will allow us to add Ots37124-12270118 back to the panel.
Ots_u07-57.120 The worst read hog, responsible for >8% of detected complete primer sequences in the sequencing data. This one is a bit complicated. It appears that we can not drop the concentration anymore because probe count is too low, however, a null allele caused by a single SNP in the probe sequences leads to 11% of samples excluded from contributing to the on-target reads for this marker. This is insufficent to improve the probe counts enough to warrant lowering the concentration.
Ots_hsp27b-150: May want in the future for GSI. Order as single tube. Ots_u07-53.133
A handful of markers (at least 11) either need allele correction value adjustments (at least 4, three clusters of allele count ratios, but heterozygotes do not have 1:1 ratio) or face more serious problems with paralogues/high ploidy (at least 7, more than three clusters of allele count ratios).
A related issue is allele droput (null alleles due to a variant covering the region tagged by the primer or probe sequences used by gtseq pipeline ). Null alleles can lead to the complete loss of an individual (no reads) or miscalling of a heterozygote as a homozygote. Multiple lines of evidence point to null alleles at relative high frequency and across many markers causing issues with genotyping. Investigating a few of these in detail reveals that common SNPw (maf ~10%) are unaccounted for in the region tagged by the probe sequences, leading to the loss of these individuals from the dataset and substantial mis-genotyping at this marker due to allele dropout.
After our lab meeting regarding GTseq problems, it seems like the best path forward is to incorporate null alleles by modifying probe sequences where appropriate. We need to discuss this in greater detail before moving forward.
My working suggestion for now is to add more populations to the dataset and run similar approaches to identify null alleles and PSV (sample dropout, gtseq_genotyper fuzzy matching, skewed allele ratios/PSV) then, for each marker, attempt to identify the problem. Using some kind of threshold (e.g. null allele is within edit distance of 1 and occurs at >5% maf within any population), then incorporate the variant into the probe sequence by either shortening the probe or adding degenerate bases. We should also consider extending probe sequences to resolve any 4 cluster genotypes into diverged duplicates and resolve diverged duplicates into simple diploid loci with allele correction values or extended probe sequences.